Kilim a server framework with lightweight actors isolation types zero-copy messaging sriram srinivasan This dissertation is submitted for the degree of Doctor of Philosophy November 2009 King's College and Computer Laboratory University of Cambridge Declaration The dissertation is not substantially the same as any I have submitted for a degree or diploma or any other qualification at any other university. Further, no part of the dissertation has already been or is being concurrently submitted for any such degree, diploma or other qualification. This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration, except where specifically indicated in the text. This dissertation does not exceed the regulation length of 60 000 words, including figures, tables and footnotes. This work was supported by EPSRC grant NRAG114 and the Cambridge Commonwealth Trust. Sriram Srinivasan, November 2009. Abstract Kilim: A Server Framework with Lightweight Actors, Isolation Types & Zero-copy Messaging Sriram Srinivasan Internet services are implemented as hierarchical aggregates of communicating components: networks of data centers, networks of clusters in a data center, connected servers in a cluster, and multiple virtual machines on a server machine, each containing several operating systems processes. This dissertation argues for extending this structure to the intra-process level, with networks of communicating actors. An actor is a single-threaded state machine with a private heap and a thread of its own. It communicates with other actors using well-defined and explicit messaging protocols. Actors must be light enough to comfortably match the inherent concurrency in the problem space, and to exploit all available parallelism. Our aims are two-fold: (a) treat SMP systems as they really are: distributed systems with eventual consistency, and (b) recognize from the outset that a server is always part of a larger collection of communicating components, thus eliminating the mindset mismatch between concurrent programming and distributed programming. Although the actor paradigm is by no means new, our design points are informed by drawing parallels between the macro and micro levels. As with components in a distributed system, we expect that actors must be isolatable in a number of ways: memory isolation, fault isolation, upgrade isolation, and execution isolation. The application should be able to have a say in actor placement and scheduling, and actors must be easily monitorable. Our primary contribution is in showing that these requirements can be satisfied in a language and environment such as Java, without changes to the source language or to the virtual machine, and without leaving much of the idiomatic ambit of Java, with its mindset of pointers and mutable state. In other words, one does not have to move to a concurrency-oriented language or to an entirely immutable object paradigm. We demonstrate an open-source toolkit called Kilim that provides (a) ultra-lightweight actors (faster and lighter than extant environments such as Erlang), (b) a type system that guarantees memory isolation between threads by separating internal objects from exportable messages and by enforcing ownership and structural constraints on the latter (linearity and tree-structure, respectively) and, (c) a library with I/O support and customizable synchronization constructs and schedulers. We show that this solution is simpler to program than extant solutions, yet statically guaranteed to be free of low-level data races. It is also faster, more scalable and more steady (in increasing scale) in two industrial strength evaluations, interactive web services (comparing Kilim Web Server to Jetty) and databases (comparing Berkeley DB to a Kilim variant of it). Acknowledgments To Alka, my dearest. Life’s a wonderful journey, and I have the best travel companion. To Amma and Appa, for the easy laughter, love and encouragement, and the early mornings and late nights they have lavished on me. This dissertation is dedicated to Amma. She would have exclaimed “hai”! To my parents-in-law, for being an additional set of parents. One should be so lucky. To my advisors. Prof. Jean Bacon, for making this excursion possible, for allowing me to gradually discover my interests and strengths without pressure, and for her gentle and selfless guidance to collaborate with other groups. To Prof. Alan Mycroft, for setting distant Bézier control points that so significantly influenced the arc of my learning, for his exhortations and for his unwavering belief that I would get there. To Dr. Ken Moody, for his critical reading of my papers and this document, for his cheerful and enthusiastic cross-pollination of ideas, and tours of beers and wines. To Reto Kramer, for his friendship, guidance, technical knowledge, and a brain the size of a planet. There are a few dissertations lurking in there. To Lauri Pesonen and David Pearce, for the critical reviews, for the long coffee sessions packed with techie talk, and for sharing my enthusiasm for programming languages. To Eiko Yoneki, Pedro Brandão, David Eyers, Dave Evans, Minor Gordon and Niko Matsakis for sharing their impressive knowledge, opinions and good humor, for challenging my biases and assumptions, and for the gentle prodding to get on with it. To Boris Feigin and Kathryn Gray for introducing me to the world of types and semantics, and holding me accountable to rigor. To seminal influences in my professional life: Ed Felt, Anno Langen, Uday Rangaswamy, Gopalan Sankaran, Jolly Chen, Paul Ambrose and Terence Parr. I am a professional learner because of them. To Tony Hoare, Simon Peyton Jones, Mooly Sagiv and Uday Khedker for giving me their valuable time and for just the right mix of hard questions and encouragement. To Tim Harris, Martin Richards, Cliff Click, Chris Newcombe and the Berkeley DB team for lending their ear, time and patience. To Samuel Kounev, Anil Madhavapeddy, Sandy & Elisabeth Fraser for their viral enthusiasm and constructive encouragement. I hope to pass that on. To various cafes in Berkeley and Cambridge: the Beanery, Roma, Nero and King’s, where I was both pupa and pupil! And finally, in a dissertation that speaks of networks, load-sharing and fault-tolerance, it seems only fitting to acknowledge my own wonderful network, one that shares my load and tolerates my foibles: my sister Vidya and little Malavika for being a rock of support; Praveen & Shaila and the cuties, Anjali and Sanjana; Lizette, for the food, coffee, the Wire and the good cheer, and to all our dear friends: Ajit, Alka (Raghuram), Amita, Anil, Arjuni, Ashwini, Aslam, Athula, Banny, Caroline, Conor, Deepak, Flavia, Geetanjali, Hanna, Haresh, Hiroo, Jennifer, Kavita, Kinkini, Kirtana, Krishna, Larry, Lata, Lise, Maithili, Mark, Mayank, Namrata, Naren, Nasreen, Nishanth, Niteen, Nupur, Peggy, Prasad, Raghu, Raman, Rao, Sanjay, Santosh Jain, Sarita, Satish, Sarah, Savita, Scarlet, Shaheen, Shobana, Simona, Tara, Tom, Trish, Usha Jain, Uma, Vandana, Viji, Vinod, Vipul. Thank you all. Contents 1 2 3 4 Introduction 13 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3 Kilim: Communicating Actors for Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.5 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Background 23 2.1 Present-day server architecture issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.1 Data plane: State transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.2 Data plane: Immutable objects ............................... 29 2.1.3 Control plane .......................................... 29 2.2 Hardware trends and server-side design opportunities . . . . . . . . . . . . . . . . . . . . . . 33 2.3 Communicating Actors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.1 Control Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.2 Data Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.4.3 Isolation: not an isolated quest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Kilim Abstract Machine 41 3.1 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Operational Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Type System 49 4.1 Design Choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.1.1 Putting them all together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Phase-1: Base Type Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.1 Phase-1 Termination and Soundness: Proof sketch . . . . . . . . . . . . . . . . . . . 55 Typestates and Heap Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.1 56 4.2 4.3 Object typestate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 CONTENTS 4.3.2 Shape graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3.3 States + Shape graph = Shape-state graph . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3.4 Properties of Shape-State graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Phase-2: Message verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4.1 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4.2 Semantic Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5 Soundness of message verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.6 Mapping to Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.6.1 Sharable types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.6.2 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.6.3 Collections and Iterators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.4 4.7 5 Kilim architecture and implementation 83 5.1 Actors and Fibers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.1.1 Continuations: An overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.1.2 Automatic Stack Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.1.3 JVM constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.1.4 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.1.5 Cooperative Preemption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.1.6 Lightweight Threads: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2 Communication and Coordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.3 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.3.1 Scheduler Hopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.3.2 Scheduling: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Network Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.4.1 The Kilim I/O architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4.2 Network Support: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.4 6 Evaluation and Performance 99 6.1 In-core benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2 Network benchmark: Kilim I/O vs. Thread-per-request . . . . . . . . . . . . . . . . . . . . 100 6.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 CONTENTS 6.3 6.4 6.5 7 CPU + Network benchmark: Kilim Web Server . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.3.1 KWS architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.3.2 KWS application: auto-completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 CPU + Disk: Berkeley DB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.4.1 Berkeley DB: original architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.4.2 Berkeley DB: Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.4.3 Berkeley DB: Benchmark and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Qualitative Evaluation: Kilim in practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Conclusions and Future Work 7.1 8 11 111 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Notation and Nomenclature Bibliography 115 117 Introduction 1.1 1 Overview This dissertation is about the structure of highly concurrent internet servers such as web, email, game, chat, name and app-servers, as well as storage managers and load balancers. An industrial-strength server is expected to deal with a large number of concurrent requests, the number being much greater than the number of available processors. Each request results in a splitphase workflow combining computation, network and disk activity. The server must cater to varying traffic patterns and multiple request classes each with its own service-level guarantee (often specified in terms of 99th percentile latency and latency jitter). It must support dynamic component upgrades (plugins, servlets, spam filters) and must be engineered with fault- and performance-isolation aspects to help remote administration, online diagnosis and rapid recovery from component failures. A single server is rarely deployed all by itself, for load-sharing and fault-tolerance reasons. Larger scales promote increasing levels of redundancy and backup, to the point that the failure of an entire data center or network can be tolerated. In other words, zooming outwards on a deployment setup, we find that the demands on a single server, listed above, are merely a miniature version of the demands placed on a cluster, which in turn resemble in miniature form those placed on the data center. If the requirements are similar, we ask whether the architecture of a data center or cluster has any relevance to the way an individual server is structured. As Fig. 1.1 shows, a data center is connected to other data centers and contains clusters offering specialized services (data storage, application modules, traffic routing and so on). Clusters are networked with other clusters and are built with individual server machines (typically SMP), which in turn contain independent but connected components (operating system kernel and application processes). The following phrases are key to this line of enquiry: active components (not always reactive), asynchronous messaging, isolation, scheduling (timing of deployment) and location affinity (assignment of application layers to clusters and of servers to machines). Isolation takes myriad forms: executionisolation, memory-isolation, fault-isolation and upgrade-isolation. These architectural aspects apply at all levels. Our thesis is that this architecture should continue on to the intra-process level as well: Servers are best built using “actors”, components with private state and a private thread of execution, that communicate with each other using well-defined asynchronous messages and protocols. Actors must be isolated, monitorable, application-schedulable, and lightweight enough to exceed the number of concurrent user-level activities. 14 1.1. OVERVIEW Data center Cluster Cluster Machine To other data centers OS Kernel OS process Passive objects in shared memory Current architecture Proposed architecture Figure 1.1: Hierarchical, nested networks in a data center. We propose extending this structure to the intra-process level. Each of these characteristics sets the communicating actors proposal apart from current mainstream approaches. All servers are currently built upon manipulating passive objects in shared memory, regardless of the control plane (kernel threads, user-level threads, event-loops, coroutines51 ). None of these approaches (object-orientation included) provides isolation in the myriad forms listed earlier. Edward Lee117 puts it succinctly: “things happen to objects, actors make things happen”. Further, none of the extant control plane schemes combine simplicity of development with adequate applicationlevel control over failure-handling, placement (processor affinity), and scheduling of activities. We will elaborate on these issues in the next chapter. Déjà Vu? The communicating state machine model is by no means new or rare. It has been widely adopted at every scale: (a) modeling approaches (Tony Hoare’s CSP,91 Robin Milner’s π-calculus124 and Carl Hewitt’s Actors87 ), (b) model checkers like SPIN,94 (c) at the implementation level, seen in isolated operating system processes communicating with others via IPC, to the kernel via syscalls, and to the graphics processor using OpenGL pipelines etc.). And of course, replicated state machines are necessary at the distributed system level.159 In many of these cases, the underlying hardware (CPU, memory, network) is shared, but isolation is maintained either physically (via processor-supported address space partitioning), or logically (for example, via linear type systems or copying data to avoid data races). At the intra-process level, we have seen generations of concurrent languages that provide isolation CHAPTER 1. INTRODUCTION 15 and fine-grained schedulable tasks. Notable examples are Per Brinch Hansen’s Concurrent Pascal,83 MESA,111 occam-pi,134 Ada,38 Concurrent C,71 Limbo,188 Erlang13 and Sing# ,60 just to name a few. None of these languages have been widely adopted (yet) in building highly concurrent servers∗ . Why? One or more of the following (technical) reasons may be relevant. First is the hegemony of the Algol-family of languages; one cannot discount the tool support and extant training in imperative languages with pointers and mutable state. This model is deeply entrenched, as any book on basic data structures can attest. Second, commodity processor architectures such as Intel and ARM have co-evolved with these languages, which means certain common patterns have found expression in silicon. For example, the call/return mechanism and stack allocation are cheap due to the presence of stack registers, register windows and stack segments. Shared memory constructs are fast due to built in CAS (compare and swap) and cache coherence machinery. In turn, C has become the official “systems” programming language. If Lisp or Lisp machines had had their way, we might possibly be able to count on hardware support for garbage collection, tail recursive calls and closures. The third reason is performance. In server-side environments, high performance always holds sway over cleanliness of a model, even consistency. For example, most modern large-scale services do away with database joins and data normalization, in the pursuit of raw speed. Trading availability for consistency, ACID (atomicity, consistency, isolation, durability) guarantees have given way to BASE54, 146 (Basically Available, Soft state, Eventually consistent). Finally, not all approaches provide configurable control over scheduling and processor assignment. Which brings us to the topic of how our work differentiates itself from existing work. 1.2 Thesis Contributions Our main contribution is to bring the communicating actor model to the world of mainstream imperative languages, pointers and mutable state, subject to the requirements and rigors of server-side environments. We address the problem of memory isolation by taking a novel approach: we treat messages as a distinct category of objects, the only kind that can be exchanged between actors. All other objects are confined to the actor in whose private heap they are created. Messages are mutable, but subject to export restrictions: (a) they are tree-structured and, (b) linearly owned — readable and writable, indeed visible, by only a single actor at a time. We present a type system (Chapter 4) that statically differentiates between ordinary internal objects and exportable messages, and further enforces the message constraints above. By controlling ownership and reining in arbitrary pointer aliasing for messages, the type system statically guarantees absence of low-level data races. The only way to affect an actor’s state is by sending it a message, and that too only after the actor explicitly imports the message into its context and chooses to change its state. This relieves the programmer from having to worry about unexpected changes to an actor’s state or the effect of memory models. At the same time, it permits messages to be unencapsulated (public, non-final ∗ Erlang is seeing an encouraging trend at the time of this writing. 16 1.2. THESIS CONTRIBUTIONS fields), and to be passed by reference (zero-copy messaging). The next section describes these aspects in some detail. The second contribution of our approach is to show that it is possible to retrofit a high-level language such as Java with actors light enough to be in the hundreds of thousands, faster to contextswitch∗ , preemptible (to a degree) and more scalable than concurrent languages such as Erlang that have low-level support for stack switching. This, in spite of the lack of access to the stack in the JVM† , leave alone modify or switch stack contexts. We present a framework called Kilim103‡ that combines a code generator, an actor framework, the type system and a small run-time library. We aim to show that with just a small number of internally consistent design principles and techniques it is possible to have speed, statically guaranteed safety, reliability and programming simplicity, and to continue to use existing tools (debuggers, profilers, IDEs) and libraries without too much change. The final contribution — most crucial for a server-side programmer — is to show vastly improved performance metrics. We compare industrial-strength products, noted for their performance in their respective categories, to their Kilim equivalents: Jetty for dynamic web services, and Berkeley DB for transactional data stores. We demonstrate lower latency and jitter, much better scalability and fairness. Nomenclature We appropriate the term actor to represent a state machine with a control stack of its own (a private thread) and an isolated heap. Our usage of this word differs from the original definition in Carl Hewitt’s87 and Gul Agha’s3 pioneering work on the actor model. There, an actor is an active object (with its own thread) with a singular identity to which messages can be addressed. We are agnostic as to whether an actor has one identity or has multiple facets. Our usage bears closer resemblance to processes used in process-calculi,124 but we reserve the term process for operating system processes in this document. The terms thread and concurrency are often used to collectively refer to a slew of facilities: automatic stack management provided by the operating system, preemptive context switching, and coordination using shared memory and locks§ . We will use the term multi-threaded programming to refer to this context, but the term thread will be used to refer solely to the control plane (the first two aspects), separating the data plane aspects of synchronization and data exchange. In particular, we will use the term kernel thread when the thread’s stack is managed and context-switched by the operating system. Finally, the term server in this document refers to one or more kernel threads, possibly spread over multiple operating system processes, that together provide a single service. Many “event vs. thread” debates107, 181 implicitly assume a single process context (more on this in the next chapter), which need not always be the case (examples are database servers such as Postgres, and web servers such as Flash). The point is that the resolution of such debates in favor of one or the other does not say whether a single process can provide a particular service with the required guarantees (performance, fault-tolerance). ∗ With the caveat that all performance comparisons are snapshots in time A method cannot reflectively discover any details about the caller’s activation frame. ‡ A Kilim is a flexible, lightweight Turkish flat rug woven with fine threads. § For example, Edward Lee’s paper on The Problem with Threads117 uses the term this way. † CHAPTER 1. INTRODUCTION 1.3 17 Kilim: Communicating Actors for Java Our original aim was to build a communicating actors framework for low-level systems programming, with an emphasis on type-directed memory isolation, fast hardware-supported context switching, and explicit interface automata specifications in the style of Clarity44 and Melange120 . We were also interested in the idea of using the SMP cache-coherence machinery for passing messages selectively and lazily by reference, or sending short messages directly in the style of URPC.24 Finally, control over actor placement influences locality of instruction and data caches and simplifies garbage collection considerably. Building such features into a language is still the long-term aim. We investigated LLVM and the D language in some detail, but switched to Java, partly for its rapid adoption in industrial environments, its excellent and portable tool support and libraries, but mainly for its portable memory consistency model. Early prototypes also showed surprising performance numbers, a testament to the quality of the JVM. Most of the ideas listed above have been incorporated into Kilim, the current toolkit. Kilim is a combination of a bytecode post-processor called the weaver, a type-checker, and a small, extensible run-time library (67k). We present the highlights below, and devote the rest of the dissertation to explaining each of these points in some detail. Ultra-lightweight threads Java threads are expensive in terms of creation time, context-switching speeds and space. Kilim’s weaver provides cooperative, ultra-lightweight threads by transforming Java bytecode using a variant of continuation passing style. It requires that the programmer specially identify methods that may need to suspend mid-way, only to resume later based on some condition. Such methods are called pausable, and marked by the annotation @pausable (see example in §1.4). The annotation is similar in spirit to checked exceptions in that all callers and overriding methods must be marked @pausable as well. The weaver injects code in all pausable methods to provide automatic stack management.2, 67 The engineering details of this transformation are explained in Chapter 5. Kilim’s actor threads are quick to context-switch and do not need pre-allocated private heaps. Kilim permits multiple user-level schedulers (including programmer-supplied ones) to partition kernel threads and processors, and for actors to migrate to different schedulers; this gives flexible control over placement and scheduling. Communication via mailboxes Kilim contains a run-time library of type-parametrized mailboxes for asynchronous message-passing and a prioritized select operator to choose from a number of ready mailboxes (similar to CSP’s alt91 ). Mailboxes have optional dynamically settable bounds to prevent producers from overrunning consumers. In the typical server scenario where a user request proceeds from one stage of processing to another, blocking an actor puts a back pressure on the earlier stages, all the way back to a specific socket connection. This is a fine-grained version of SEDA-style I/O conditioning.186 Mailbox references are first-class values that can be incorporated into messages, as in π-calculus.124 These features permit a dynamically evolving topology of interconnected components. They separate the initiator of a request from the thread that reaps the result, useful for example in RPC chains.162 They also allow deferred synchronous approaches where multiple requests may be initiated before 18 1.3. KILIM: COMMUNICATING ACTORS FOR JAVA waiting for a response. Mailbox references are intended to be shareable across address spaces (§7.1), but the current implementation is an intra-process solution only. Messages as a special category Traditionally, most messaging solutions treat messages specially. Java, for example, requires any object that can be sent as a value between address spaces to be marked as Serializable. However, serialization is an expensive operation even in a network scenario, and prohibitively so in an intraprocess setting. For this reason, we are interested in passing message objects by reference, serializing them only if they cross process boundaries. But passing objects by reference between actors requires us to confront the problems of pointer aliasing. We will defer to Chapter 4 to discuss these problems in detail, as well as the scores of attempts to bound the problem space. Below we provide an overview of the special treatment accorded to messages by Kilim. We treat messages as philosophically distinct from, and much simpler than, other Java objects. Messages are: • Unencapsulated values without identity (like their on-the-wire counterparts, XML, C++ structs, ML datatypes and Scala’s case classes). The public structure (instance fields are public and mutable) permits pattern-matching,123 structure transformation, delegation and flexible auditing at message exchange points; these are much harder to achieve in the presence of encapsulation. The public structure also results in far less syntactic noise compared to the setters and getters seen in object-oriented wrappers.74 Note that these wrappers do not really provide any encapsulation because they return pointers to internal objects and retain pointers supplied to them; their setters and getters merely add to accidental complexity. • Not internally aliased. A message object may be pointed to by at most one other message object (and then only by one field or array element of it). The resulting tree structure can be serialized and cloned efficiently, and effortlessly stored in relational, XML and JSON schemas). This structure is widely adopted by convention, almost without exception. Examples include events or messages in most server frameworks (in fact, all the ones that we are aware of), windowing systems, and the Singularity operating system.60 Enforcing this convention simplifies the type system enormously. This style of simplification is philosophically similar to the SPARK20 subset of Ada, in that it is sometimes worth limiting the full generality of the language in exchange for guaranteed reliability. • Linearly owned. A message belongs to, and is visible to, a single actor at a time. After sending a message, the actor loses complete visibility of that message (and all reachable objects). This allows efficient zero-copy message transfer (by reference) where possible, without the danger of any mutations to that data being simultaneously visible to any other actor. The programmer has to explicitly make a copy if needed, and the imperative to avoid copies puts a noticeable “back pressure” on the programmer. Linearity is often informally specified in commercial code; figure 1.2 shows an example taken from the Apache MINA125 framework (an event-driven asynchronous API over Java’s NIO facility). Statically-enforced isolation We enforce the above properties at compile-time. Isolation is interpreted as interference-freedom, obtained by keeping the set of mutable objects reachable from an actor’s instance fields and stack CHAPTER 1. INTRODUCTION 19 class HttpSessionContext { .... /** * Writes the specified response back to the client. * The response must not be modified after it has been * submitted for comittment - irrespective of whether a commit * is successful or not. * * A request may have only one response committed to it. The return * value of this method can be used to determine whether the * supplied response will be used as the single response for this * request. *

* Application code must not modify a response in any way after it * has been committed to a request. The results of doing so are * undefined. * */ boolean commitResponse( HttpResponse response ); } Figure 1.2: Example of informally specified linearity, seen in Apache MINA documentation totally disjoint from another actor’s. Kilim’s weaver performs a static intra-procedural heap analysis that takes hints from isolation qualifiers specified on method interfaces. The uniqueness properties of the type system can be used for recycling resources such as buffers and security capabilities. If Kilim were to target a lower-level machine without garbage collection (as was the original plan), such tracking would be necessary for memory management even for immutable objects. Portability Kilim is designed to meet the requirements of production server-side systems. With a view towards immediate industrial adoption, we imposed the following additional constraints. The solution must be portable, in that it should not depend on a particular JVM or rely on facilities afforded by specific compilers (such as the pluggable type system in an upcoming version of javac140 ). For the same reason, it should not require any syntactic modifications to the Java source language, which unfortunately precludes many interesting constructs. It should remain within the idiomatic confines of Java (using exceptions, mutable state and so on), except, of course, for the concurrency constructs – those are replaced with message passing. For the reasons stated above, Kilim’s weaver is a bytecode post-processor (Fig. 1.3). It performs the type checking and CPS transformation; the transformed code runs in conjunction with the supplied run-time library. The Kilim toolkit provides mechanism, not policy, and offers largely orthogonal feature-sets. For example, if the JVM were to natively provide lightweight threads, one should be able to use the messaging mechanisms and the isolation type checking aspect without any change. The mailboxes are a library construct; unlike built-in primitives such as channels in some concurrent languages, mailboxes are not given special treatment by the Kilim weaver. This permits variants in terms of number of producers and consumers (currently mailboxes have n producers and one consumer), customized versions 20 1.4. EXAMPLE javac annotated src parse external annotations typecheck heap model bytecode isolation check CPS transform bytecode Kilim weaver Figure 1.3: javac output post-processed by Kilim weaver for primitive types, and buffering strategies (unbuffered, fixed size buffers). The scheme also permits other constructs such as futures and Haskell-style MVars to be provided at a later stage, and lets the type system guarantee safety. 1.4 Example Figure 1.4 illustrates a simple Kilim program (names that have significance to Kilim are colored gray). It sets up a chain of actors, where each actor knows about its mailbox and that of the next in the chain (and notably, is not given the other actor’s reference). The main thread pushes an empty StringBuffer into the first actor’s mailbox, which writes “hello” into the buffer and passes the modified StringBuffer on to the next actor, and so on. The last actor appends “world”, prints it out and exits. The structure of an actor is analogous to that of a Java thread: it extends the Actor class instead of Java’s Thread, and its entry point is execute() instead of run(). Note the use of the @pausable annotation on execute, because it calls the mailbox’s get() and put() methods, which may pause if the mailbox does not have messages or (respectively, is full). Note that main does not have to be declared pausable because it only calls a non-pausing method called putnb(). Chapter 5 has more details on mailboxes, actors, and support for cancellation and scheduling. The following steps illustrate the compilation, post-processing and running steps. > # > # > javac -d ./classes Chain.java Post-process using weaver java kilim.tools.Weave -d ./classes examples.Chain Run example java -cp ./classes:$CLASSPATH examples.Chain 1.5 Dissertation Outline Chapter 2 (“Background”) argues why the current paradigms of mainstream server-side design are flawed or inadequate not just for future trends, but for present day requirements and conditions as well. CHAPTER 1. INTRODUCTION 1 2 21 package examples; import kilim.*; 3 4 // A mailbox for exchanging StringBuffer objects. 5 class Smb extends Mailbox {} 6 7 8 9 public class Chain extends Actor { // instead of ‘‘extends Thread’’ Smb mymb, nextmb; public Chain(Smb mb, Smb next) {mymb = mb; nextmb = next;} 10 @pausable 11 public void execute () { // equivalent of Thread.run() while(true) { StringBuffer sb = mymb.get(); if (nextmb == null) { // last actor. System.out.print(sb + "world"); System.exit(0); } else { sb.append("hello "); nextmb.put(sb); // send to the next actor } } } 12 13 14 15 16 17 18 19 20 21 22 23 24 public static void main(String args[]) { Smb mb = new Smb(); Smb nextmb = null; // Set up the chain for (int i = 0; i < 1000; i++) { new Chain(mb, nextmb).start(); nextmb = mb; mb = new Smb(); } // Send a message to the first element in the chain. nextmb.putnb(new StringBuffer()); } 25 26 27 28 29 30 31 32 33 34 35 36 37 } Figure 1.4: Sample Kilim program: Each actor in a chain of actors appends “hello” to a mutable buffer and passes it on. The last actor prints the buffer. Chapter 3 (“Kilim Abstract Machine”) is a precursor to the chapter that follows. It presents the operational semantics for an abstract machine that is similar to the JVM, but with support for memoryisolated actors and message passing. Messages are physically moved from one actor’s state to another. Chapter 4 (“Type System”) presents the isolation type system and a proof of soundness. It restricts the set of allowable KAM (Kilim Abstract Machine) programs to ones where a dangling reference is never used. This means that a Kilim program deemed type-safe by the weaver is guaranteed not to access any part of a message that has been sent, thereby ensuring an absence of data races at compiletime. Chapter 5 (“Kilim architecture and implementation”) describes the bytecode transformation to achieve lightweight actors, and a discussion of the design points employed in the run-time library. Many of these issues are pertinent to other mainstream imperative languages as well. 22 1.5. DISSERTATION OUTLINE Chapter 6 (“Evaluation and Performance”) demonstrates micro-benchmarks (comparisons to Erlang on tasking and message passing performance) and two macro-benchmarks: a CPU+network comparison pitting the Kilim Web Server against Jetty (a popular industrial-strength web server), both of which are based on the same underlying Java NIO libraries, and a CPU+disk comparison pitting Berkeley DB (a widely adopted key-value store) to its Kilim variant. Background 2 We observed earlier that messaging between isolated state machines is widely adopted; we now present the case for the approach to be applied at the intra-process level as well. The structure of this chapter is as follows. We first present arguments for why present-day serverside solutions create a fair amount of accidental complexity, in the words of Fred Brooks,36 and why they are sometimes inadequate to handle the essential complexity of the problem as well. We then take a brief look at emerging trends in processor and network architectures (§2.2). We argue (§2.3) that the communicating actor model is better suited to handle present-day problems as well as likely future scenarios. Finally, we examine some notable research work related to building highly concurrent servers. 2.1 Present-day server architecture issues In this section, we study mainstream server-design approaches, and in the section on related work (§2.4), we will examine other proposed solutions that are notable, although not necessarily mainstream. At a logical level, all server-side frameworks and applications have concurrent threads of control that transform the state space in a collaborative fashion. We study how threads of control synchronize (the control plane; §2.1.3) separate from what it means to collaboratively update the system’s state space (data plane; §2.1.1). Two logically equivalent implementations may differ considerably on other axes of practical importance: the cost model, composability, fragility in the face of software evolution, run-time diagnosis and so on. 2.1.1 Data plane: State transformation In this section we focus on that part of the system state that two or more threads of control may be interested in modifying concurrently. We categorize different concurrency control schemes according to the way they segregate updates to the state space. In temporal partitioning schemes, the memory in which the updatable state resides is shared, and different threads of control take turns directly modifying the state space in-place; only one is permitted to update it at any one time. In spatial partitioning schemes, the state space is spread over partitioned memory spaces, and multiple partitions may contain data pertaining to the same logical entities. Memory can be partitioned physically (memory banks, hardware-supported process isolation) or logically partitioned (with a type system); either way, each thread of control has its own space for in-place modification. 24 2.1. PRESENT-DAY SERVER ARCHITECTURE ISSUES We explore these two categories in some detail, and argue that update isolation via spatial partitioning is preferable to temporal partitioning. Temporal partitioning via locks The primary temporal partitioning scheme uses locks to serialize updates to shared memory, and is by far the most popular of all partitioning approaches. All mainstream languages possess fine-grained memory sharing semantics. This is true even in cases where there are no pointers, such as lexically shared variables amongst closures. Instructions such as CAS (compare and swap) and LL/SC (load-linked, store-conditional) help perform atomic updates on shared memory cells and enable a variety of synchronization constructs such as semaphores, monitors, locks, mutexes and condition variables. The provision of fine-grained data updates (different threads can perform updates of a single memory word) and fine-grained locking (each Java object has a private lock) can make for a highly concurrent system, but as experience has shown,26, 117, 173 it is also extremely error-prone and hard to reason about. Here are some prominent issues with fine-grained locking and sharing: Locks and multiple objects. An object does not get to decide its atomicity context, the application does. While monitors are useful for making consistent changes to a single object (a bank account, for example), there is the need to atomically effect changes to multiple objects; for example, a bank transfer requires that changes to two accounts is made atomically. The programmer must make arrangements to safely update a set of objects without the language’s help, which is a hard problem when different threads of control are updating different but non-disjoint sets of objects. This is closely related to the next point. On-demand lock acquisition. Since locks are acquired on demand as a thread proceeds, multiple threads may acquire different sets of locks in different orders depending on their control flow. This is likely to cause deadlocks (see §2.1.3). Frequently, a lock implicitly grants permission to the lock owner to update a given set of objects, but errors creep in when the same set of objects can be reached via another traversal that does not pick up the required locks along the way (more on this in the next section). Separating locking and unlocking. If one procedure locks an object and another unlocks it, and if the procedures can be called in any order and any number of times, it is practically impossible to reason about that object’s consistency. This is a problem even if the locks are reentrant. Monitors and Java’s synchronized blocks couple locks and unlocks with lexical and dynamic scope, which is a good thing, but there are other problems with it, as we shall see in the control plane section. Locking is expensive. Applications are sensitive both to the overhead of locking and to the time taken within critical sections. Lock contention can be extremely expensive for a waiting process, especially if it is forced to suspend until the lock is available. Much literature on concurrency is focussed on highly concurrent data structures. The problem does not lie here; after all, data structures offer bounded and predictable time windows of contention, and are available in the form of libraries written by expert programmers. The problem lies in using the lock paradigm in a more general setting (in the application area, and by less trained programmers); in one instance (from personal experience), a single log write inserted within a lock window produced bizarre timing behavior, depending on the information contained in the log data. Locking is optional Mainstream languages do not record (or require the programmer to declare) whether an object can be accessed by multiple threads concurrently, or require that a sharable object be CHAPTER 2. BACKGROUND 25 always accessed under a lock. Errors due to lock omission are highly non-deterministic; they may not even show up as a problem until run on a different processor configuration. As Per Brinch Hansen84 points out, the term monitor in the Java and .NET environments is used erroneously to refer to objects with built-in locks, since a monitor by definition must always hide state and protect it using a built-in mutex. Most concurrency errors can be traced to the fact that these are optional facilities. Note that locking is a logical synchronization mechanism that acquires a permission before executing a corresponding action. As such, locks can be used in a message-passing system to regulate access to a shared actor, and one should expect to see the problems above in such a situation as well. However, the scale of such problems is considerably less than the usage of locks in mainstream languages due to the fact that the more fine the grain of sharing, the more are the chances that a given data structure has multiple access and update paths through its constituent objects. Coupled with the undecidability of precise static pointer alias detection,148 it is hard to guarantee absence of low-level data races. In the section on related work, we will briefly cover lock-free data structures, another way to synchronize access to shared data without indefinitely blocking the threads. Temporal partitioning via scheduling The second way of ensuring that threads take their turns updating a portion of shared memory is by scheduling them appropriately. Scheduling can be implicit in the way the code is arranged to execute. The event-oriented style of code, essentially a single thread with a main loop, and seen in many (perhaps most) server and GUI frameworks, is a classic example of an implicit style of scheduling control. This pattern provides “run-to-completion” semantics, because the thread of control is not interrupted preemptively, which in turn ensures that state updates can be made consistently without using locks. These properties are often presented as the reason for the simplicity and performance of an event-driven style over a multi-threaded architecture. While the event-driven style is the mainstay of most C-based non-threaded applications, it uses only a single processor. All event-vs-thread debates107, 137, 181 implicitly assume a uniprocessor context for comparing single-threaded event loops to a multi-threaded programming style. Another problem is that changes to the state machine tend to be brittle, because splitting up a relatively long-running computation into multiple intermediate states may invalidate the earlier assumption that the shared data structure is in a consistent state every time the flow of control returns to the main loop driver. Temporal partitioning via explicit scheduling, while rare in typical server applications, is used widely in real-time and parallel programming environments. For example, the OpenMP library provides compiler pragmas to help the programmer annotate which sections and which loops can be evaluated in parallel. These schemes tend to be brittle or non-portable as they tend to hard-code assumptions about the hardware. Temporal partitioning, shared objects and relaxed memory consistency All temporal partitioning approaches rely on updating shared objects in place; we now examine what it takes to share memory between processors. As processor speeds have exponentially outstripped memory speeds, a wide variety of tricks have been employed to blunt the impedance mismatch. Several levels of caches have been introduced, memory updates are deferred by write-behind caches (in the hope of batching writes), instructions are routinely eliminated by the compiler, or reordered due to compiler optimizations and instruction level 26 2.1. PRESENT-DAY SERVER ARCHITECTURE ISSUES 1 2 class Foo { private Bar bar = null; 3 4 5 6 7 8 9 10 11 1 2 3 synchronized public Bar get() { if (bar == null) { bar = new Bar(); } return bar; } } 4 5 6 7 8 9 10 11 class Foo { private Bar bar = null; public Bar get() { if (bar == null) { synchronized(this) { if (bar == null) bar = new Bar(); } } return bar; } Figure 2.1: Double-checked locking error. The code on the right is incorrect. parallelism. On a uniprocessor, these tricks vastly improve performance, yet provide an illusion of a single memory, and behave as if the instructions and data were evaluated in program order (as-if-serial semantics). However, given multiple processors, it is prohibitively expensive to provide sequentially consistent110 cache coherence. Consequently, every multiprocessor system can at most provide some form of relaxed consistency, which means that at any given point in time, different threads may have different values of a particular memory address (in registers or in the cache), and thereby for a particular language-level variable. Java programmers discovered147 the pitfalls of relaxed consistency the hard way when the commoditization of multi-core processors began to smoke out bugs not just in well-tested software, but also in in well-worn idioms. Subsequently Java,100, 121 the .NET platform and lately C++ 27 have elucidated portable memory model semantics that recognize the reality of relaxed consistency, without sacrificing well-known optimizations and reordering mechanisms (both in the compiler and the processor). They do this by identifying special operations such as constructor calls, synchronized blocks, access of volatile variables etc. to serve as checkpoints, where writes to memory before a checkpoint is guaranteed to be visible to other processors after the checkpoint. The compiler is not permitted to reorder code across these checkpoints. The key here is that the program must be “correctly synchronized” to prevent low-level data races, but there is no automatic way to detect if a program is incorrectly synchronized. The classic doublechecked locking optimization idiom158 is a textbook example of code that could be expected to work in a uniprocessor, but fails in a multiprocessor context (see Fig. 2.1). The code on the left (of Fig. 2.1) lazily initializes an instance field, under the protection of the parent object’s monitor. Since acquiring a monitor’s lock can be expensive under contention, and because the field is going to be initialized just the first time, programmers balk at having to penalize every subsequent access just to check if the field is non-null. One common pattern was the code shown on the right, where we skip the synchronization in the common case where bar is non-null. This is erroneous. It is possible that due to compiler and ILP-induced reordering, the updates to the corresponding object’s fields (the object pointed to by bar) have not yet made it into that thread’s processor. Without a synchronization barrier, there is no incentive for the thread’s cached values to be updated. Given the deeply entrenched mental model of imperative control flow, it is hard to reconcile with the fact that updates do not necessarily happen in program order, across threads. CHAPTER 2. BACKGROUND 1 2 3 4 5 6 7 8 9 10 11 class String { private int hashcode; public static int hashcode() { int h = this.hashcode; if (h == 0) { h = ... compute hash ... this.hashcode = h; } return h; } } 27 1 2 3 class String { private int hashcode; public static int hashcode() { 4 if (this.hashcode == 0) { int h = ... compute hash ... this.hashcode = h; } return this.hashcode; 5 6 7 8 9 } 10 11 } Figure 2.2: Memoization and benign data race. The code on the right is incorrect 1 2 3 4 5 6 7 8 9 10 11 class String { private int hashcode; public static int hashcode() { int h = this.hashcode; if (this.hashcode == 0) { h = ... compute hash ... this.hashcode = h; } return h; } } Figure 2.3: Legal (but unsafe) rewrite of Fig 2.2(right) The next example (Fig. 2.2) is an even more subtle illustration∗ of the perils of relaxed consistency. It considers two ways of lazily computing the hashcode for an immutable String object. Unlike the previous example, this example updates just a single int field. The code on the left (of Fig. 2.2, as seen in the JDK source ( Java Development Kit)) exemplifies a benign data race; in the worst case, one expects that two threads will find the hash code to be 0, race to compute it and to overwrite the hashcode field to the same value. The field being a 32-bit integer, this update is guaranteed by Java to be atomic. What could go wrong? It appears that the example on the right is equivalent (and has the same benign data race), but even expert programmers have trouble discerning a possible (though highly improbable) data race in the code on the right. In the absence of any synchronization checkpoints, the compiler is free to rewrite the code on the right to that shown in Fig.2.3. This transformation, while correct in an uniprocessor setting, introduces two reads of hashcode (Fig.2.3, lines 4 and 5). If this.hashcode was zero initially, but updated by another thread between those two lines, the method will skip the if -block and return the cached value of h (zero). Even with a defined memory model, this is not the level of reasoning or attention one can expect from even seasoned programmers. We end this section on memory consistency models with the disquieting note that processor memory models are specified informally,131 and that this is likely to be a rich source of bugs in the near ∗ Thanks to Jeremy Manson for this example. 28 2.1. PRESENT-DAY SERVER ARCHITECTURE ISSUES future. One needs constructs (such as messaging) that remove all memory consistency concerns out of the hands of the larger programming community, by concentrating them in a few maintainable concurrency constructs. False sharing Cache-coherent shared memory can also result in false sharing. Caches invalidate an entire line (typically 32 to 512 bytes) even if just one word was updated. If the code were to have an array of data items, and say thread 0 modifies item 0, thread 1 modifies item 1 and so on, we would experience cache thrash although the updates are logically isolated. Such patterns are seen in arrays of locks, arrays of shared counters (for statistical purposes), and lock-free versions of concurrent array-based queues. Performance problems due to cache line ping pongs are exceedingly hard to detect by a visual examination of the code, and programmers have to pad structures to cache line widths to ensure that isolated units of data are mapped to separate lines. We can safely conclude that programming with a combination of fine-grained access to shared memory, in-place updates and locks is not simple in a multi-processor environment. Data plane: Spatial Partitioning of updatable state We now turn our attention to spatial partitioning schemes that physically or logically partition, or spread, the state space among multiple threads of control. Either way, the threads of control do not share memory directly, but work on their own copies of the state. At some point, the threads of control resolve their differences and come to a shared understanding — consensus — on what constitutes the latest consistent overall state. The important characteristics of this approach are: memory is not shared, state is replicated, and consistency is lazily achieved (as opposed to the locking approach, where the permission is acquired before the update is made). It is also known as an optimistic approach, in that a thread makes a change to a private copy assuming that no other thread is likely to touch that logical bit of data, but is prepared to redo or abort the computation if there is consensus that that bit of data has indeed been updated. Spatial partitioning is of course necessary at the distributed system level, but has proven to be a sound engineering approach even on multiprocessor machines, although it is possible to address all of system memory and share objects directly. Many prominent database and web servers use the multiprocess approach to take advantage of hardware-supported memory isolation. Our preference for spatial partitioning stems from the observation that the underlying hardware abstraction necessarily exposes itself as a distributed system (relaxed memory consistency) and that the expectation of lazy conflict resolution is in tune with that of the underlying hardware. The biggest advantage of spatial partitioning of memory is that every bit of code has the guarantee that all variables in scope, and all objects reachable from those variables, belong to a single updater. One does not have to worry about objects getting modified unexpectedly, nor about memory consistency (as long as the scheduler issues a memory fence before switching the thread of control over to another processor). This permits compilers and processors to preserve all the optimizations made for the uniprocessor era, and for us to continue with the familiar program-order semantics. Objects can be traversed and mutated at a fine-grained level, and reachability defines the logical extent of a thread’s heap. The problem is that pointers and global variables make it impossible to statically verify confinement. Chapter 4 discusses the Kilim type system and other related work on this issue. CHAPTER 2. BACKGROUND 29 The disadvantage of the spatial partitioning approach is that multiple threads working away on their own copies (until reconciliation) might involve duplication of data and effort. This turns out to be not much of an issue in a thread-per-user or thread-per-connection architecture in servers; except for games, most applications do not involve interactions between users. 2.1.2 Data plane: Immutable objects Another common approach is to build systems with aggregates of immutable values. Reducing the amount of concurrently mutable data in the system is always sound from the concurrency point of view. Threads can share immutable objects by reference. A transition to a new state is achieved by atomically switching to another collection, mutating the aggregate in place, or using a persistent data structure. There are a few issues, however. If the aggregates are mutable in-place (at some level, one has to reflect the fact that the logical state space is mutable), we still have the problem of making consistent changes to these aggregates, which reduces to one of the two partitioning schemes above. Still, the fact that there are fewer shared mutable objects to worry about is a good thing. Next, the immutability property must be deep. An object is immutable only if all reachable objects are immutable too; however, this definition indicates that one cannot have an immutable collection of mutable objects. In an essentially-mutable language such as Java where there is no way of identifying or tagging a group of objects, there is no way to distinguish one reachable object from another (a cons cell versus a contained element). In §4.7, we will briefly examine ownership types that provide a mechanism to tag objects in a parametrized fashion. The third pitfall is that for items that are big and likely to change often (protocol buffers, for example), it does not always pay to make them immutable, because any change results in deep copies (which is why we categorize this pattern of programming as a spatial partitioning scheme). Finally, many immutable objects feature memoization. As we saw in the discussion of Fig. 2.2, observational immutability has its weak spots. Boyland33 catalogues other issues related to adding a readonly modifier to Java. 2.1.3 Control plane Most modern APIs — language-level interfaces as well as service-level interfaces provided by web services, file systems, databases etc. — are implemented as function calls. This permits a given task to be handled in a structured, hierarchically decomposable fashion, and the code to be written in a sequential and imperative style. The thread, which encapsulates the program counter, the call stack and local variables is an integral part of the state machine, because it implicitly records the work that has already been performed. One convenience of using threads to encapsulate the callstack — and arguably the source of many problems as well —- is that a thread-local context is available to store objects such as transaction context, security context and profiling information which are of interest to all methods on the callstack, without having to thread them through method parameters. There are some critical problems with this service-provided API approach, however. 30 2.1. PRESENT-DAY SERVER ARCHITECTURE ISSUES Control plane: Client-side issues The fundamental problem with functions as the means of requesting service is that functions are expected to run to completion, which means that the caller is blocked, prevented from handling any further events or issuing any other requests until the callee finishes. This is an issue in most modern GUI frameworks and up-calls in network protocols today. GUIs have been known to freeze when a remote file system connection is broken, for example. It is also impossible to add cancellation semantics to long-running computations that may block, because there is no opportunity for a blocked client to notice or act upon a cancellation notification. The underlying issue is that the problem context is inherently concurrent, but the code is forced into a single serial execution. Such a serialization also prevents use of other processors even when inherently parallel solutions exist. The following example illustrates all these issues. The Java Transactions API (a clone of the X/Open XA API) provides a transactional coordinator interface to be used by clients, and a separate resource interface implemented by databases and queue managers. Multiple resources are coordinated by the two-phase transaction coordinator. Now, consider a client that begins a transaction and involves two databases in that transaction before committing it. tx = Transaction.begin(); db1.dostuff(); db2.dostuff(); tx.commit(); // // // // associate transaction tx with the current threads db1 and db2 implicitly register themselves ... ... with tx from the thread-local context launch two-phase commit of db1 and db2 The transaction coordinator’s commit method can be (naively) implemented in a linear fashion in the following way. Note that the prepare, commit and rollback methods are all part of a standard resource interface. class Coordinator { void commit() { // Prepare-phase for (Resource r: getRegisteredResources()) { if (!r.prepare()) { this.rollback() return; } logPrepare(); // record commit decision to disk-based transaction log // Commit phase for (Resource r: getRegisteredResources()) r.commit() } } The synchronous request/response nature of the standard interfaces permits the code to faithfully follow the two-phase protocol. The problem, one among many, is that the client is blocked until commit is done. Neither the client nor the transaction coordinator can react to a timeout or cancellation once the commit process is on its way. Also, once the first prepare phase is over and the transaction is committed to disk (logPrepare), the rest can be done offline without stalling the client; this is not the case here. The resources ought to be preparable in parallel, but are forced to do so serially due to the blocking nature of prepare. Such problems are omnipresent, from system-call interfaces to RPC in enterprise-frameworks such as Java EE (Enterprise Edition), CORBA, .NET and so on. CHAPTER 2. BACKGROUND 31 What if the service were to offer completely asynchronous interfaces? For example Some network and file-systems, for examples, offer async or non-blocking interfaces (such as POSIX AIO), where the issuing of the request is separated from handling of the response. We believe that such a separation merely shifts the problem to the latter aspect. If the response is delivered as a callback, then the caller cannot easily deliver a response to its caller as a result value; it too is forced to defer its response in a separate callback. This is known as an inversion of control; instead of the code driving the control flow in a sequential, imperative fashion, one programs in a reactive style∗ . Facilities such as anonymous functions, block closures (in Smalltalk, Ruby and Scala156 ), or even Java’s ersatz version in the form of inner classes avoid the problems faced by C, in that they permit response-handling code to be placed lexically close to the request-initiating code.81 All these closure and callback-based solutions however have the reverse complication: they make it more complicated for the client to block. Further, it is easy to lose sight of the fact that the callback could be invoked in a different kernel thread; care must be taken to protect shared lexically scoped data, and barrier synchronization constructs should be used to keep in lock-step. The third alternative to completely synchronous or completely asynchronous interfaces is a deferred synchronous approach; instead of passing in a callback function or a closure as a parameter to a request, the invocation immediately returns a promise or future object, allowing the client to proceed asynchronously while the request is being processed. The client can pass this object around as a firstclass primitive, in some cases even to other processes (incidentally, a potentially non-scalable solution because it requires distributed garbage collection). The client can either poll the future object or wait on it until a response has been generated; either way, it blocks its own caller. To avoid this, it needs to return a future of its own which depends on the earlier future, and so on up the call stack. Such composition of futures need first-class support for higher-order functions to be achieved elegantly. The advantage of the deferred synchronous approach is that most implementations permit waiting on multiple future objects simultaneously, which allows the caller to initiate many requests and wait at one rendezvous spot for all responses. The other problem with blocking the client (other than being unable to respond to any other event) is that in all mainstream approaches, the callstacks are mapped directly to kernel threads, which are expensive to create and to context-switch. What if threads were so lightweight that blocking them is a non-issue? All concurrency-oriented languages have constructs such as spawn, cobegin or par to spin off additional user-level threads. Our framework, Kilim provides such a facility as well: a Java program can quickly spawn hundreds of thousands of ultra-lightweight user-level threads (Chapter 6 quantifies the terms lightweight and heavyweight). Kilim’s mailboxes, described later, provide deferred synchronous behavior and multiplexed waiting on multiple mailboxes. Note that it is still possible for the code to block a kernel thread by making a blocking system call. There are three options: (i) Use a facility such as scheduler activations9 where the kernel scheduler coordinates with a user-level thread scheduler to dispatch I/O notifications (ii) tolerate the call, in the hope that the blockage will not last for too long (iii) provide an adapter that blocks the user-level thread, but internally issues asynchronous calls to the kernel. Since the first option is not widely available, our approach is to rely on the latter two; we will elaborate our design in Chapter 6. In conclusion, the ideal control plane on the client side involves a combination of asynchronous messages and blocking calls in conjunction with lightweight threads and deferred synchronous constructs. For software evolution, one needs to be able to change the client from a synchronous to ∗ Sometimes called the Hollywood principle: “Don’t call us, we’ll call you”. 32 2.1. PRESENT-DAY SERVER ARCHITECTURE ISSUES deferred synchronous interface without modifying the interface. Note that asynchronous messages can be point-to-point (oneway) or broadcast (publish-subscribe mechanisms). In this thesis and in the Kilim framework, we concentrate on oneway messaging only. Control plane: Service-side issues We now survey the problems introduced by the function-oriented API approach from the service provider’s point of view. Services and objects react to events and do not have the luxury of driving the control flow imperatively. The most common pattern is to use top-level function calls as entry points into the service. However, the service cannot dictate the order in which to receive events, and must be multi-thread safe to coordinate multiple client threads. The trouble is compounded117 in patterns such as subject-observer which involve reentrant callbacks or conversations with the client. This is fertile ground for deadlocks, as two threads can obtain locks in reverse orders, because locks are obtained on demand, decided by the flow of control. This problem is closely aligned to another that crops up often in practice. Monitors and Java’s synchronized construct protect all service calls with first-come first-served locks∗ . Since the lock queue is hidden as part of the monitor implementation, a service cannot know which clients are blocked, leave alone being able to handle clients out of order. Consider the case of a list of transactions waiting on an object’s monitor. One might like the transaction with the earliest deadline to proceed first, or the one that started earliest (assuming heuristically that it has performed more work and is more expensive to rollback); such heuristics cannot be implemented in a system that does not expose the queue. A more imperative or active way is for the service to explicitly state the events or messages it is willing to handle at any point in the code. Concurrent languages have constructs for message reception (receive in Erlang, accept in Ada), multiplexing (CSP’s alt, for example), and pattern matching; these permit the service to use its own stack for processing the message. By pulling the message into its own context, a service can decide the order of event processing. Note that in the database example above, real databases are able to reorder transactions precisely because they handle the message in a different operating system process or machine (hence a different thread by construction) and also because they use asynchronous messages, not RPC, between client-side drivers and the servers. The point is that function-call based interface pushes events to the service in an indeterminate order, while a messaging interface allows the service to pull messages in a relatively deterministic order. A related problem with the function-call approach is that for any higher-level protocol that requires multiple function calls, the service has to squirrel away the intermediate state somewhere, to pick up from where it left off in the last call. This problem is routinely encountered by web service developers. In a case where the service has its own thread and stack, the stack automatically tracks intermediate state. In chapter 5 we will examine many continuation-based frameworks (Kilim included), some developed exclusively for web interaction. The next problem with a functional API for the service side is that there can be at most one response to a given request, which cannot handle single-request/multiple-response patterns of streaming or result-batching; examples are, retrieving rows from a database, lines from a file and bytes from a ∗ Most monitor implementations are FIFO for fairness reasons, even if the specification does not require it CHAPTER 2. BACKGROUND 33 socket. In all these cases, the caller does not have to wait for the entire data to come in before processing can begin. Ideally, a server ought to be able to be push multiple responses, and correspondingly, the client ought to be able to pull the data at its own convenience. Traditionally (in C, Java), a callbackbased interface is used here, but that solution inverts the flow of control. 2.2 Hardware trends and server-side design opportunities Let us examine hardware and systems trends at both the micro and macro scales (within a processor, and within a data center respectively). These scales are not entirely unrelated, as cloud and utility computing services are serving an increasingly large number of personal and corporate computing and storage needs. Chip-level processing. Computing speed has depended on our ability to pack more and more transistors on a single die, and on increasing clock frequencies. The heat generated by these two architectural features, and the resulting power and cooling requirements have garnered as much attention from data center administrators as the increases in speed. At the time of this writing, it is estimated that the cost of the energy expended by a server over its lifetime (four years) exceeds the cost of the server itself. The scale of operations can be inferred from the fact that one of the largest data centers at this time is a 192 MW facility. Benchmark metrics such as Sun Microsystems’ SWaP (Space, Watts and Performance) seek to normalize standard performance metrics (such as TPC) by rack height and power consumption. Another consequence of increasing processor speeds is the exponentially widening disparity between CPU and memory speeds. A simple cache miss to memory can cost about 200-300 clock cycles and dominates performance metrics. Many levels of cache hierarchy and complicated instruction level parallelism techniques have been introduced to fill this gap, but it is untenable to stretch it further. These problems have forced a move towards chip-level multi-processing (CMP, also known as multicore processors), where more processors are packed into a single die, and where each processor is simpler and operates at lower frequencies. This has the desirable effects of making the system more compact (by sharing on-chip resources such as L2 cache and memory controller), reducing the mismatch between processor and memory speeds, and running cooler. Since power consumption has a cubic relationship with frequency, one can either consume far less power or enable more cores within a fixed power budget. Many-core chips are already a reality for niche applications such as graphics, multimedia and signal processing. Let us presume for a moment that chips with hundreds of cores will be commoditized29 in general-purpose computing as well. In such a context, applications written with a single-threaded, event-oriented mindset will necessarily underperform. There are many more such applications than meet the eye. For example, the up- and down-calls in network protocol stacks,78 and push-pull strategies of routers such as Click106 are products of a single-processor mindset, and will need to be reengineered. Protocol layers will need to be independently schedulable to take advantage of a number of cores. Chip-level multi-threading (CMT) is a still finer grain of division, where a single core’s execution resources are mapped to several (but fixed) number of threads. When a thread’s execution stalls due to an off-chip request, the processor moves on to another thread. Even as memory latency is the 34 2.3. COMMUNICATING ACTORS single biggest factor in a server’s performance, the memory bandwidth is increasing163 at a rapid clip (60GB/s/chip). CMT allows many more threads to stand by waiting for memory. The point is that hardware parallelism is getting cheaper. Clearly, single-threaded servers do not map well to this trend. As it turns out, present day multi-threaded applications are not always transparently portable to the many-core era either, as many of them rely implicitly on low processor counts. For example, using a shared mutable B-tree subjects the root to exponentially rising latch contention (with the number of processors). The solution might be to partition data (sharding) or replicate data amongst processors. In other words, we should treat a single multi-core or multi-processor box as the distributed computing environment that it is,23 and pay attention to locality and interconnect traffic; such non-shared, messaging solutions resemble the ones adopted at the level of the data center. Section 2.4.3 discusses a few notable projects that are taking this route. Fast networking. Changes in processor design apart, the drop in memory prices coupled with Gigabit networking represent an inflection point in the way online storage may be structured in the future. The RAMClouds project138 proposes to aggregate the memories of thousands of servers to provide a 1000x increase in throughput coupled with very low latency (5-10µs). The new slogan is “RAM is the new disk, disk is the new tape”. Mixed language. Finally, there is increased usage of multiple languages on the server-side, often presented as a mixture of “systems programming languages” and “scripting” languages (e.g. Google’s AppEngine, YouTube). These environments too present isolated subsystems that communicate using well-defined messages. We will argue later for an asynchronous messaging scheme. 2.3 Communicating Actors Our thesis is that a system of actors with lightweight, memory-isolated threads and communicating with explicit messages is well equipped to handle the problems of present-day systems and the opportunities of the future. As Edward Lee says, “things happen to objects, actors make things happen”.117 Advantages Threads and blocking calls permit code to be written in a linear, imperative style. When threads are light enough to comfortably match or exceed the concurrency requirements of the problem, blockinga non-issue. Memory isolation avoids the problems and mistakes such as low-level data races and the subtleties of memory models associated with the current shared-memory paradigm. We advocate an actor model that is not internally concurrent (atomic, in the taxonomy of Briot35 et al. , so that all messages sent to it are handled in sequence. This architecture is equivalent to obtaining a monitor-lock per actor. The only way an external entity can affect an actor’s state is by sending it a message, and the actor can check its invariants at message sending and reception points. Messaging gives equal syntactic treatment to both requests and responses. By dissociating the number of responses for a given request, the actor can implement push-style responses. In fact, an actor can even autonomously send out a message; it does not always need to be reactive. Messages can be acted upon in a batch, which improves spatial and temporal cache locality. Messaging avoids lost notifications because it unifies the control and data planes: a message contains the changed data as well as the signal. Note that messaging by itself does not prevent deadlocks, but CHAPTER 2. BACKGROUND 35 buffered messaging and the coarse-grained concurrency nature of actor systems in practice make the problem considerably more tractable. Channel-based messaging combines the ownership information of locks and the count of notifications inherent in semaphores. Messaging pipelines permit easily visualizable dataflow models. These are particularly useful cases where the connection topology is static and the rates of message arrival and the cost of message processing are known in advance. For such networks, a scheduling plan can be created ahead of time.143 The actor architecture permits debugging, testing and performance profiling in isolation. Just as one debugs hardware by putting probes on the wire, message sending and receiving primitives can be instrumented to monitor the data that enters and exits an actor. Static isolation guarantees can help garbage collectors and compilers to aggressively optimize use of the current hardware infrastructure, more than is possible today. Since all objects are accessed by a single actor at a time, and since in our approach, all objects reachable from a given object are also guaranteed to be visible by the same actor (and only that actor) at a given point in time, the garbage collector can perform heap traversals on a per-actor basis without fear of interference. A compiler (and the programmer) need not account for the possibility that reordering memory accesses and hoisting computations out of a loop may be visible outside that actor. Issues This section examines a few problems seen with actor-oriented systems in practice. The term actor as defined originally by Hewitt87 refers to an active object, whose identity is used for sending messages. In practice, a single mailbox used to receive messages could suffer performance problems, because messages retained in the mailbox tend to get pattern matched over and over until they are drained, an issue in languages such as Erlang. In addition, since an object can only have a single type (or a single instantiated type in the case of parametrized types), actor systems tend to be untyped to accommodate unrelated message types in the same mailbox. Without low-level hardware support, pipelines and active objects (actor) are much more expensive than virtual function calls on passive objects, which necessarily puts a back pressure on the programmer to reduce the frequency of messaging. Although process calculi and actor systems can model shared memory approaches, in practice sharing passive objects is much more efficient at a fine-grained level. While a judicious use of message batching irons out performance differences, the problem with the actor model is that the coarseness of an actor (the grain of schedulability) is fixed at compile-time; as the actor becomes finer, the fraction of switching overhead increases and the opportunity to run a computation to completion (until invariants are re-established) is reduced. In a language with pointers and mutable state, pointer aliasing is a burdensome issue. An actor can embed a pointer to some data structure in a message, and then subsequently modify that data structure. It is difficult to track such breaches of isolation. We will devote considerable attention to this issue in the next few chapters. One can avoid aliasing by making messages immutable (§2.1.2), or by copying data items, but these can both lead to considerable extra work for the garbage collector. Immutability must be a deep property in order to be usable, but given that there is no clear-cut difference between a resource such as a socket and an ordinary data item, one has to ensure that all objects in a message are immutable or pure (have no side-effects). 36 2.4 2.4. RELATED WORK Related Work Concurrent systems16 have been the richest veins of research since the early days of computing; entire books can not cover all the research work. In this section, we discuss principles and languages that have been used to implement systems that have seen a measure of end-user usage, or those (such as software transactional memory) that are gaining currency at the time of this writing. Briot35 et al. supplies a useful taxonomy for object-oriented concurrent and distributed systems. Peter Van Roy153 presents informal operational semantics of all concurrent models in a high-level language, Oz; in particular, the treatment of declarative concurrency and logic variables is recommended. While the determinism of declarative concurrency is valuable for reasoning and correctness, we have not yet pursued that avenue because our focus has been on those server-side systems that tend to be non-deterministic due to changing traffic patterns and failures. That said, the approach of wiring together components in a static dataflow configuration is intuitive and appealing; tiny sensor systems have been written in the communicating state machine model using the nesc programming language69 (including the TinyOS68 operating system), and the BBC’s Kamaelia102 framework is a python library with a vast number of media components. 2.4.1 Control Plane Most concurrent languages provide a similar slew of facilities for spawning threads of control, and synchronizing transfer of control between them. We discuss thread creation and scheduling, and service invocation approaches on the client and service end of things following the discussion of §2.1.3. One basic primitive common to all concurrent languages is a lightweight thread (the ability to create tens of thousands of them would classify it as lightweight for our purposes). The Kent Retargetable occam-pi project,134 which combines CSP and the mobility features of the π-calculus, is the standard bearer for lightness and speed; it demonstrates switching and process-creation speeds in the range of tens of nanoseconds on 800Mhz processors. Most concurrency languages, Kilim included, separate forking threads from joining. Some, like Mesa111 and Erlang13 take a functional approach where any function can be used as a starting point for a thread, and some, like Ada and Kilim take an object-oriented view (with defined entry points). Some such as occam and Cilk provide systemic and syntactic support for fork-join parallelism (such as a par construct), where the lifetime of a child thread never exceeds its parent’s lifetime. This pattern is useful for parallel programming, but less useful for the workflow or staging style often seen in server systems. Next, take service invocation. On the client end of things, languages (such as Mesa, ABCL and SR) allow the client to invoke a service synchronously, asynchronously or in a deferred synchronous fashion. John Reppy’s Concurrent ML generalizes the deferred synchronous approach; instead of calling receive on a channel (which waits until a channel has data), one can call recvEvt on a channel. This returns a form of a future object called an event; the program can wait on this object using sync. Multiple event objects can be waited upon using a non-deterministic primitive, choose. The power of the system comes from the ability to compose combinators of event objects, and to sync on them. Kilim’s mailboxes can be invoked in both synchronous or asynchronous fashion (§5.2), and additionally sup- CHAPTER 2. BACKGROUND 37 port callbacks for such higher-order event-composition. The limitations of the Java language’s syntax render the Kilim approach considerably less elegant than using Concurrent ML. On the service end of things, the ability to selectively receive messages is important to imperatively drive the flow of control. Kilim’s facility for blocking on multiple mailboxes permits filtration on both the content and the source of events. Erlang and Ada provide syntactic support for selective message reception and guarded expressions (that need to be true in order to receive an event), but provide no control over source-level filtration.∗ It is often necessary for an application to filter or reorder events based on the event’s source. For example, a database may give a higher priority to a request from an older transaction. In most monitorbased approaches (including Mesa, Ada and Java) where event-delivery is via method calls, and in actorbased approaches with a single mailbox (Erlang), the event source is not provided by the framework. The ability to receive messages on multiple ports relieves the programmer from having to separate events by source. Such a pattern is used widely in practice; for example, the Internal Revenue Service in the US uses different P.O. box numbers for different criteria, thereby getting the tax payer (the source of the document) and mail service to implicitly sort the tax returns for them. Finally, a word about scheduling control. Typically, concurrent languages and environments do not permit the application programmer to dictate thread placement, processor affinity and order of scheduling. We believe that spatial and temporal cache localities are important considerations for server-side programmers intent on squeezing performance, especially in many-core configurations where inter-core communication bandwidth may be a limiting factor. We will discuss this further in chapter 5. 2.4.2 Data Plane The Concurrent Pascal83 and Mesa languages were seminal influences on most concurrent languages, in protecting shared objects with monitors. As it turns out, we are still prone to the same issues that Lampson and Redell111 cataloged in an Mesa experience report in 1980. As Per Brinch Hansen84 argues, many of these problems can be traced to the fact that mainstream languages and environments have weakened the original definition of monitors, by permitting internal state to be visible publicly, and without an automatic lock. Cantrill and Bonwick,39 on the other hand, presents a strong endorsement of shared memory and fine-grained locking; our only caveat is that most programmers do not have the expertise of the authors. In locking approaches, the thread holding the lock can perform actions in a critical section, but if it is context-switched out, no other thread waiting on that lock makes progress. A non-blocking algorithm65 does not employ locks or critical sections. The key difference is that unlike the locking approach where a thread only does its own thing, in a non-blocking approach, a thread performs some other thread’s work if it is unable to do its own, as long as the overall system moves forward. Here is an analogy. Consider a laundry room with a washer and a drier. Assume that the washer has finished washing its load, but the person who started that cycle has gone out for a long walk. Your options are: (a) wait (block) until that person returns and moves the clothes to the drier or, (b) you transfer his clothes yourself before loading your own. The latter is non-blocking. It is particularly useful ∗ This is not a huge problem; the source task or process can embed its id in a message. 38 2.4. RELATED WORK in preventing priority inversion (where a lower-level thread owns a lock and gets swapped out by a higher-level thread that happens to want that lock). Non-blocking algorithms are decidedly non-trivial; there are a handful of experts in the world able to produce portable and correct lock-free code. Instead of general algorithmic patterns, the focus now is on packaging this expertise72 into lock-free container classes (lists, queues and so on). The problems described in §2.1.1 remain, however; concurrency guarantees are only for the supplied container classes, not for the elements they contain. That said, we expect to use these for Kilim’s internal data structures, such as mailboxes and schedulers (at the current stage of evolution, non-blocking structures in Java are more expensive). A far more general, and more tractable, approach to lock-free updates of shared memory is Software Transactional Memory (STM). As with database transactions, all reads and writes within an atomic block belong to a transaction, and tracked accordingly in a separate log. At commit time, the writes are atomically committed to shared memory if and only if the read and write sets have not been updated by another transaction. Otherwise, the transaction either aborts or retries automatically. Software transactions are composable85, 86 when implemented as language-supported lexical scopes and are deadlock-free regardless of the pattern of access. The flip-side is an indefinite number of aborts or retries — livelocks — in the presence of contention, as seen in optimistic database transactions. This is a non-trivial performance issue. Livelocks, unlike deadlocks, are hard to detect. As functions compose and more and more functionality gets pulled within the scope of a transaction, the window of contention increases dramatically. The other problem related to STM performance is that the more fine-grained the data to be mutated, the more the transactional state that needs to be maintained and compared at commit time. STM is meant for providing serializability, and does not cater to looser consistency; for example, if an unrelated value (such as time) were to be read while processing a transaction, an update to that value would roll the transaction back. But worse is omitting to include the read of a relevant value inside a transaction; this has the same problem as lock omission. Cascaval42 et al. express considerable skepticism about STM’s viability as a replacement for fine-grained locking. The other counter to STM is that traditional locks are not heavyweight any more14, 189 (5-9 instructions for acquiring and releasing locks), and that whole program analyses exist28 to remove unnecessary synchronization. STM is however perfect for a language built upon principled (and rare) mutability. Languages such as Erlang, Clojure and Haskell are built on a foundation of immutable values and use functional data structures and multiple versions of data instead of in-place updates. Haskell provides the best approach to STM, wherein the type system statically verifies that within the scope of a transaction, only specially identified transactional types can be modified; no others can be modified and no I/O can be performed. The language can automatically retry the transaction because the old state and inputs are immutable and not consumed. For most languages, mutability and pointers are the norm, and almost without exception, pointers embedded in messages allow contents to be simultaneously visible and mutable by both the sender and receiver. We seek guaranteed isolation. The E and Limbo languages embrace in-place updates and pointers, yet guarantee isolation. Actors do not share memory and limit pointer aliasing, mutability and reachability to the confines of an actor. In order to communicate with others, an actor either exchanges immutable values or clones (deep-copies) object graphs. The cost of cloning is a concern in a low-level systems framework (protocol layers, for CHAPTER 2. BACKGROUND 39 example). Erlang too copies values between actors’ heaps for ease of garbage collection,99 but an analysis directed storage allocation scheme exists41 to statically discover which allocations are purely local and which may be exported. Mesa’s hardware-supported copy-on-write approach is another possibility; however, such an approach needs to distinguish between cloneable data values and (uncloneable) references to finite accountable resources (e.g. database connection). Kilim is similar to E and Limbo, but the mechanism used to guarantee isolation is a significant point of difference. Our approach is to avoid cloning by sending messages by reference, but as we mentioned in Chapter 1, the type system enforces linear ownership of a message. This permits the receiver to assume ownership of the object and to mutate it in-place without the worry of the sender or any other actor being able to observe that object. The programmer has to explicitly clone a message if he needs to retain a copy; this renders the cost of cloning explicit in the code. 2.4.3 Isolation: not an isolated quest We briefly describe a few recent research projects that serve as reference markers for the Kilim project. All these approaches have one architectural aspect in common: memory-isolated communicating actors. The Multitasking Virtual Machine53 (MVM) provides multiple logical JVMs (isolates) within a single process. Communication is via serialized messages. The Incommunicado project139 provides a lightweight mechanism for helping isolates communicate, by deep copying objects instead of serialization, because the machine and the class files used by the isolates are the same. In “The End of an Architectural Era (It’s Time for a Complete Rewrite)”, Stonebraker169 et al. propose rewriting databases as a collection of interacting database actors, called sites. Each site runs on a single core, and is a single-thread state machine that implements a query interpreter and has its own copy (or partition) of data and accompanying indices. Fault-tolerant storage is achieved not by archiving on disk, but by replicating copies in RAM among different sites. The Barrelfish22 project is a multi-kernel operating system based on similar principles. A kernel runs on a single core, and communicates using explicit asynchronous protocols with kernels on other cores. State is replicated, not shared. It is built on the premise that operating systems currently make optimizations and assumptions — related to the consistency model, the cache hierarchy, and the costs of local and remote cache access — that are unscalable and sometimes not even applicable to newer processors of the same family. They demonstrate the limitations imposed by the interconnect bandwidth, and show that message passing can be much more efficient than shared memory. At its current stage of evolution, it is unclear whether the message passing interface is available to the application level, whether one can have multiple actors per core, and whether actors in a process can run on separate cores. The Erlang project was one of the original inspirations for Kilim. We wish to combine the extant experience and familiarity of the Algol family of languages with some of the systemic features of the Erlang environment (that is, we borrow ideas from the run-time, not the language), especially its failure recovery model.12 The essential idea is to avoid exception handling altogether. Instead, fail-fast semantics let an actor crash, which in turn might lead to the crash of other dependent actors. At the topmost level of this cascade are a hierarchy of supervisor objects which re-spawn actors from known consistent state. Another key lesson from Erlang is the fallibility of micro-benchmarks. A C programmer would 40 2.4. RELATED WORK not be faulted for assuming that an environment such as Erlang would be slow in a systems context, given the following features: interpreted code, isolated actors and data copied between heaps. Yet, end-user systems written in Erlang are more than acceptably fast in practice,57 and win on many other counts such as fault-isolation and simplicity. Latest versions of Erlang have just-in-time compilers and true SMP support. The project with an approach very similar to ours is the Singularity operating system, written in a language called Sing# . Singularity is a microkernel that relies on a type system (discussed in Chapter 4) for creating so called software isolated processes (SIPs). It supports pointers and garbage collection within a SIP, and separates message objects from internal objects. Messages are hierarchically structured, as in Kilim. Such features (along with security policies) permit device drivers, the kernel and user-code to be all part of the same address space and hardware protection ring. Messages are allocated from (and live in) a special exchange heap . Communication and synchronization is performed through channels, which can specify a (bidirectional) contract in the form of a finite state machine specification. This approach is more expressive and powerful than guard expressions in Erlang and ADA; it separates the interface automaton from internal code. We find this approach much more intuitive compared to session types.70 The current version of Kilim is but a starting point on the path that Singularity has blazed. The essential difference is that ours is an evolutionary approach, which allows a programmer to mix and match the vast set of extant Java libraries along with the messaging mindset. In the rest of this dissertation, we will show that it takes just a few modifications to retrofit the communicating actor approach into a mainstream language, and that it is possible to improve speed while guaranteeing safety and preserving the simplicity of sequential code. Kilim Abstract Machine 3 This chapter presents an imperative and concurrent abstract machine called the Kilim Abstract Machine (KAM) to discuss the ideas presented in the next chapter. The KAM is an intermediate language similar in spirit to portable virtual machines such as the JVM∗ , the .NET CLR, Parrot and LLVM. The reason for an abstraction such as the KAM is to highlight a different approach to two aspects: memory model and concurrency-awareness. The JVM119 and others feature a shared memory architecture where threads are part of a library, and the language and execution engine are not aware of scheduling and interleaving semantics. All objects are potentially sharable; each object in the JVM is provided with a private lock that tracks (a) which thread currently owns it and, (b) the threads waiting to obtain the lock. The JVM provides modifiers such as volatile and synchronized, and primitives such as wait and notify to control thread execution and to specify consistency points where data updated by one thread is visible to other threads. Importantly, a method or class definition does not provide any hints about whether an object is concurrently accessible by multiple threads, which renders concurrent reasoning hard and forces compilers to be conservative in their (sequential) optimizations. The KAM, in contrast, features a communicating sequential actors model where each actor is a thread with a private heap. It distinguishes, via a type system, the objects that are never going to leave the confines of an actor (structs), from objects that may be exchanged between actors, messages. A third category of objects, mailboxes, provides a globally accessible id-space and synchronization primitives. The type system constrains the shape and pointer aliasing of message objects. This chapter describes the operational semantics of the machine and the next chapter presents the type system. Our choice of primitives and semantics is primarily driven by a pragmatic view; we wish to remain within the idiomatic ambit of environments such as the JVM, .NET and C, and we eschew features that may not translate to efficient code in these environments (tail call optimization is not always available, for example). At the same time, we chose isolation as a key feature of the underlying model to firstly simplify the mapping (and hence the proof of soundness) between the static and dynamic semantics and secondly, to be able to track that piece of memory in a garbage collector (similar to Singularity’s Exchange Heap). 3.1 Syntax Fig. 3.1 shows the syntax for the Kilim Abstract Machine visually separated between features commonly found in typical virtual machines such as the JVM, and Kilim-specific primitives for tasking and communications. A Kilim program consists of a list of named structures (struct for internal objects and msg for exportable messages). Mailboxes are typed; a mailbox typed as M mbx may only be ∗ We will henceforth use the JVM as a representative model for the others. 42 3.1. SYNTAX x, y, p ∈ variable names m ∈ method names b ∈ variable names (for mailboxes) S ∈ struct type names M ∈ message type names f ∈ field names lb ∈ label names n∈Z Program ::= Struct Struct ::= struct Msg ::= Func ::= τ m(q τ p) { lbopt : Stmt } Stmt ::= x = Rhs | x. f = y | x. f = msg ::= τ ∈ Type q ∈ Qual S {τ f} m(x) | null | n new Msg M{ τf } spawn Rhs Func put null | ifeq x y lb | return x | b x | x | x. f | new S | m(y) | M | newmbx M | get b | select b ::= int | S ::= whole | M | M | part | mbx safe | none Figure 3.1: KAM syntax. Shaded parts indicate Kilim support for communicating actors. used to transfer message type M . Struct names, message names, mailbox types together with int form the set of base types. A function contains a list of statements, which are simple three-address constructs. In addition to the usual statements for transfer of control (if, return and function calls) and assignment, one can spawn a new actor (spawn), create a new sharable mailbox (newmbx), put and retrieve messages from mailboxes (put and get), and use select to select a non-empty mailbox from a set of mailboxes (similar to CSP’s alt construct but with no guard expressions). The new statement creates a structure or message object in the heap and returns an object reference. Local variables contain integer values, object references and a distinguished null value. Variables are treated like untyped registers and spring into existence on assignment. Global variables and static class members are not present. We shall distinguish between the terms message object and a message. The former refers to a single object, whereas the latter refers to the entire cluster of objects transitively reachable from a given (root) object. That is, when we refer to a program variable x as a message, we implicitly include all the objects reachable from that variable (via its various fields), and transitively so. To export or send a message is to use it as an argument to the spawn and put primitives, which remove the message (the entire cluster of linked message objects) from one heap and transfer it to another actor or a mailbox. A message not reachable from any other object in an actor’s heap is termed a whole message, otherwise it is a part message. The type system ensures that only whole messages are sent. A function declaration contains a list of typed parameter declarations and a typed result. Each parameter has a name (unique to the function signature), a base type and a type qualifier. Type qualifiers are required for message types to dictate ownership and consumption policies; the next chapter delves into the details of qualifiers and their effects. Struct types must always be qualified with none (recall CHAPTER 3. KILIM ABSTRACT MACHINE 43 that is an abstract syntax, not a higher-level language syntax exposed to the programmer; it is simpler to be explicit for the sake of uniformity). Each function parameter is a local variable and a function signature is the only means to associate a variable with a type; a type inference mechanism infers the rest (§4.2). Well-formedness rules. A program is well-formed according to the following rules. Structure names and message type names are unique within a program, and while mutual and cyclic type references are permitted, message type names cannot be used to type fields in a struct declaration and vice-versa; this is the first step towards separating between internal and exportable objects. Mailboxes are the only entities sharable between actors, and both object and message declarations may freely refer to the same mailboxes Field names are unique within a structure and parameter names are unique within a function signature. Function names are globally unique (no overloading) and a distinguished function called main (no arguments) is required to serve as the program’s entry point. Statements in a function may optionally have labels (unique within the function) and the target label in an if statement must exist within the function. The function supplied to spawn may only have mailbox and integer-typed parameters. Functions returning a message type can only return whole messages (the return qualifier in the method signature must be whole), and put and get are built-in primitives that take and return whole messages only. We assume automatic garbage collection but omit features such as subtyping, virtual method dispatch, access privileges, exceptions and inline structs; these are largely orthogonal to our expository goals. §4.6 presents further details on how these issues are mapped to the Java setting. 3.2 Operational Semantics The dynamic semantics of KAM is defined as a small-step transition relation on a world configuration. A world W consists of a set of actors, and a shared post office P which keeps track of a set of sharable mailboxes. The grammar for the runtime configuration is in Fig. 3.2. (Please note that the notation and nomenclature is summarized on page 115) Each actor has a private heap that maps object ids to values, object ids and mailbox ids among them. Mailbox ids are globally unique names that can be shared by embedding them in messages. This is equivalent to mobility of channel names in the π-calculus model.124 The operational semantics is shown in two parts: actor-local transitions (Fig.3.4) where the world evolves by transforming a single actor A; the rest of the world (the other actors As and the post-office P ) are unchanged and unaware of the changes to A. Fig. 3.5 shows transitions that have a global effect (the change may be visible to another actor); this includes modifications to the set of actors, the set of mailboxes in the post-office, or changes to a particular mailbox. The initial world configuration (A, ;) is a single actor and an empty post office, where the actor A is initialized with an empty heap and an activation frame loaded with the body of the distinguished entry function main: A = ((body(main), ;), ;). An actor ceases to exist after popping the last activation frame off the callstack, when it executes return. Note that the semantics for return are present in both figures 3.4 and 3.5 to account for the non-empty and empty callstacks respectively. The first one merely changes the local actor, the second changes the list of actors. 44 3.2. OPERATIONAL SEMANTICS W ∈ World = Actor × PostOffice Global configuration: bag of actors and a central post-office. One actor is non-deterministically chosen for reduction. A ∈ Actor = Callstack × Heap Actor configuration: call stack with a private heap S ∈ Callstack = Frame Call stack: list of activation frames F ∈ Frame = Stmt × Locals Activation frame: A list of instructions (Fig. 3.1) and a local variable store. The sole candidate for reduction (for this actor) is the head of the instruction list (Stmt) of the frame at the top of the actor’s call stack. σ ∈ Locals = Var 7→ Val Local variable store (or register file): maps variable names to values. Val = Id ∪ n ∪ {null} Values. Id is a domain of unforgeable object identities (distinct from integers) that refers to both objects and mailboxes. H ∈ Heap = Id 7→ Object Partial map of ids to objects. Actors and mailboxes each have a heap of their own. Object = FieldName 7→ Val Objects are not nested; fields contain references to objects or mailboxes or are integer-valued or null. P ∈ PostOffice = Id 7→ Mbx A collection of mailboxes indexed by mailbox id B ∈ Mbx = Id × Heap Mailbox. A collection of distinguished root object references, pointing to objects in the mailbox’s own heap Figure 3.2: KAM runtime configuration A mailbox is a passive repository of messages. It stores whole messages in a private heap and keeps them separate using an additional list of root references. There is no lookup mechanism for mailboxes; an actor can use a mailbox only if it has obtained its id from another actor or created one itself. The fact that object and mailbox identities are unforgeable permits one to create private channels of communication. An actor can never directly affect the state of another actor (except for spawning one). Clearly, this means that for actors to communicate, they must use one or more mailboxes. Put, Get. The put statement makes use of a split operation to partition the actor’s heap into two disjoint parts (as in separation logic), one containing those objects reachable from the supplied object, and the rest. The former is separated from the actor’s heap and merged into the mailbox. Note that this may leave dangling references from the remaining heap, and the actor’s local variables, into the heap partition that is now merged into the mailbox. In addition to transferring objects from an actor’s heap to a mailbox’s heap, put adds the root reference to a distinguished set of root references. A subsequent get does the reverse: it selects one of these roots from the given mailbox, and transfers all reachable objects into the actor’s heap. Select. The select statement takes a list of mailbox ids and returns the index of a mailbox ready for interaction. As implemented currently, select is only used to signal that a mailbox is get-ready (is non-empty), but does not signal a put-ready mailbox (signifying that the mailbox has one or more slots available). While easily rectified, this issue is orthogonal to the main thrust of our work. CHAPTER 3. KILIM ABSTRACT MACHINE 45 fields(T ) = 〈τ f 〉, given a structure (correspondingly message) declaration of the form struct T {τ f } args(m) = p, the list of parameter names, given a method declaration of the form τm(q τ p){lbopt : stmt} body(m) = stmt, given a method declaration of the form τm(q τ p){lbopt : stmt} = block(m, lb) (sk , · · · sn ), the list of statements beginning at label lb in function m, which is of the form τ m(q τ p){lb0 : s0 , · · · , lbk : sk , · · · , lb n : sn } and lb = lb k for some k. = split (ρ, H) (Hrem , Hρ ) where ρ ∈ Id and H , Hrem , Hρ ∈ Heap Splits heap H into Hρ , the cluster of ρ, and the remaining heap Hrem Hρ = H (ρ0 ), for all ρ0 ∈ dom(H ) and ρ0 ∈ cluster(ρ, H ) Hrem = H (ρ0 ), for all ρ0 ∈ dom(H ) and ρ0 ∈ / cluster(ρ, H ) cluster(ρ, H ) = merge(H1 , H2 ) = Smallest set X such that ρ ∈ X ∧ (ρ0 ∈ X ⇒ ρ  H (ρ)( f ) if ρ ∈ dom(H1 )  1 λρ. λ f . H2 (ρ)( f ) if ρ ∈ dom(H2 )  undef otherwise H ρ0 ). (see defn. 3.1) Figure 3.3: Auxiliary function definitions used in Figs. 3.4 and 3.5 Remark. We assume blocking semantics for get and select and non-blocking semantics for put; the reason for not explicitly modeling an actor’s scheduling state is that it has no bearing on actor isolation. That is, it wouldn’t make an actor any more or less memory-isolated if get were to return null instead of blocking until the mailbox is non-empty. The configuration is stuck when any actor is stuck, and halts when the list of actors is empty. An actor is stuck if it accesses a null or dangling reference, but importantly, is not considered stuck when blocked on a get or select. Definition 3.1 (Reachability). Given a heap H , we define n-step reachability as follows ( f ranges over field names): Let ρ 1 ρ0 if ρ, ρ0 ∈ dom(H ) ∧ ∃ f .H (ρ)( f ) = ρ0 Let ρ n ρ0 if ρ, ρ0 ∈ dom(H ) ∧ ∃ f .H (ρ)( f ) n−1 ρ0 We use the following notation to indicate a path between two objects in the heap. ρ H ρ0 ¬ ∃n.ρ n ρ0 . 46 3.2. OPERATIONAL SEMANTICS (( x = null :: s, σ):: S, H ) ((s, σ[x 7→ null]):: S, H ) (( x = n :: s, σ):: S, H ) ((s, σ[x 7→ n]):: S, H ) (( x = y :: s, σ):: S, H ) ((s, σ[x 7→ σ(y)]):: S, H ) (( x = y. f :: s, σ):: S, H ) ((s, σ[x 7→ H (σ(y))( f )]):: S, H ) (( x = new T :: s, σ):: S, H ) ((s, σ[x 7→ ρ]):: S, H ), where ρ is fresh (( x = m(y) :: s, σ):: S, H ) ((s m , σ 0 )::( x = m(y) :: s, σ):: S, H ) where s m = body(m) and, σ 0 [ pi 7→ σ(yi )] and pi ∈ args(m) (( x. f = y :: s, σ):: S, H ) ((s, σ):: S, H [σ(x)( f ) 7→ σ(y)]) (( x. f = null :: s, σ):: S, H ) ((s, σ):: S, H [(σ(x), f ) 7→ null]) (( ifeq x y lb :: s, σ):: S, H ) ((s 0 , σ):: S, H ) ( where s 0 = (( return x :: s, σ)::( x 0 = m(y) :: s 0 , σ 0 ):: S, H ) block(mc , lb) if σ(x) = σ(y) s otherwise ((s 0 , σ 0 [x 0 7→ σ(x)]):: S, H ) See also return in Fig. 3.5 Figure 3.4: KAM Operational Semantics (part 1). Actor-local transitions of the form (A • As, P ) (A0 • As, P ), where A = (( stmt :: s, σ) :: S, H ). S is the callstack, H is the actor-local heap (Fig. 3.3). The current function at the top of the call stack is denoted by mc . (Note: mc is not explicitly tracked in the rules, for simplicity of exposition) CHAPTER 3. KILIM ABSTRACT MACHINE ((( spawn m(y) :: s, σ):: S, H ) • As, P ) 47 (((s, σ):: S, H ) • ((s 0 :: σ 0 ) :: [ ], H 0 ) • As, P ) where s 0 = body(m), H 0 = ; σ 0 [ pi 7→ σ(yi )], pi ∈ args(m) ((( return x :: s, σ):: [ ], H ) • As, P ) (As, P ) See also return in Fig. 3.4 ((( b = newmbx T :: s, σ):: S, H ) • As, P ) (((s, σ[b 7→ ρ]):: S, H ) • As, P ) ρ is fresh. ((( put b x :: s, σ):: S, H ) • As, P ) (((s, σ):: S, H 0 ) • As, P 0 ), where ρ x = σ(x), (H 0 , H 00 ) = split(ρ x , H ) ρ b = σ(b ), P (ρ b ) = (roots b , H b ) P 0 = P [ρ b 7→ B 0 ] B 0 = (roots b ∪ {ρ x }, merge(H b , H 00 )) ((( x = get b :: s, σ):: S, H ) • As, P ) (((s, σ[x 7→ ρ]):: S, merge(H 00 , H )) • As, P 0 ) (H 0 , H 00 ) = split(ρ, H b ) for some ρ ∈ roots b and where ρ b = σ(b ), B = P (ρ b ) = (roots b , H b ) P 0 = P [ρ b 7→ (roots b \ {ρ}, H 0 )] ((( v = select b :: s, σ):: S, H ) • As, P ) (((s, σ[v 7→ ρ b ]):: S, H ) • As, P ) for some ρ b ∈ {σ(b0 ), . . . , σ(bn )}, and |roots(P (ρ b ))| > 0 Figure 3.5: KAM operational semantics (part 2). World transitions of the form (A • As, P ) (A0 • As0 , P 0 ) Type System 4 The Kilim Abstract Machine, modeled on the JVM, stores untyped values (object references and integers) in the local variables and the heap. This chapter describes a type system that introduces aliasing and ownership constraints to eliminate two broad classes of errors. One set of errors arises from using a value in the wrong context: using an object reference as an integer for example, or using an undeclared field name, or using an object of one type where another is expected, etc. Such simple type errors are easily prevented with a conventional Java-style type system that reasons about the types of values in the local variable store and in the heap. This is done in the first phase of type verification (§4.2). However, simple type verification is not enough. One of the key aspects of languages with references is aliasing, where there are multiple ways of obtaining the reference to a given object. Aliasing takes the form of variable aliasing when more than one variable refers to an object, and heap aliasing, when there are multiple field references into an object (multiple incoming edges, if you visualize the heap as an object graph). For all its efficiency benefits, aliasing is a constant thorn in the side of modular reasoning. Consider the following simple example: 1 class C {Socket sock} 2 3 4 5 6 void foo (C x, C y) { close(x.sock); process(y.sock); } This example closes a socket that belongs to x, and uses one that belongs to y. Clearly, it is a problem if x and y refer to the same object. But even if they were different objects, we have no information about whether they refer to the same underlying socket. The issue is that in pointer-based languages, one cannot tell the difference between refers and belongs. The second issue is that foo modifies objects reachable through its arguments, and these modifications persist after foo returns. Specification languages that describe a method’s side effects tend not to be used in practice, either because they are incomplete or verbose; further, the deeper the nesting of method invocations, the more the possible side effects. In addition, a function’s side effects may depend on the aliasing of its input arguments, which is why we cannot reason about a function on its own. The paper “The Geneva Convention on the treatment of object aliasing” by Hogg93 et al. is an excellent overview of the perils. They categorize all approaches to deal with the aliasing issue in terms of: • Alias Detection. These approaches perform post hoc alias detection. However, given that precise 50 4.1. DESIGN CHOICES alias detection is undecidable148 (they must necessarily be conservative), and that whole program analyses are NP-hard, we do not use them. • Alias Advertisement. Programmer-supplied annotations advertise whether (and how) they create new aliases to parameters. • Alias Prevention. Aliases can be implicitly and automatically prevented by many techniques, an important one being treating assignment as a deep-clone operator. We do not use these methods for efficiency reasons. As a side note, we prefer potentially expensive operations such as messaging and context switching to be explicitly spelled out, as a constant and visible reminder to the programmer.∗ • Alias Control. These schemes enforce explicit control over aliasing, in that it is a compile-time error if an undesirable alias is created. We take this route, as described in the next section. 4.1 Design Choices We now outline the key principles and design decisions for a second phase of the type verification process. • Ownership transfer. As we mentioned in Chapter 1, we retain the mutable mindset of Java and other current systems programming languages, but make transfer of ownership a central feature in both the KAM and the type system, to ensure that an object is exclusively owned by a single actor at any time. The KAM does not offer immutable objects because we do not (yet) distinguish between objects that represent resources vs. normal data values; just because an object is immutable (say a file handle) does not mean that it is not a resource with ownership constraints. The type system outlined in this chapter enables a compiler to ensure that there are no dangling references. • Alias control only for messages. Empirical evidence from industry suggests that although aliasing is a nuisance, it tends to be easily manageable in practice for purely sequential code, especially when garbage collection is provided. This is because sequential patterns of data access are well understood and bugs in such scenarios are usually detected and fixed easily. On the other hand, aliasing and concurrency are a toxic cocktail. Since messages are the only kind of objects to be exchanged among concurrent KAM actors, we enforce aliasing and ownership handoff constraints only on messages. Such constraints may be considered too onerous for objects in general. We discuss more elaborate methods of alias control in the section on related work (§4.7). • Tree Structure. Kilim does not allow a message object to be heap aliased; it may be reachable from only one other object at most, and that too via a single field reference. That is, the following code is unacceptable: x.f = y; x.g = y. Multiple local variable aliases are permitted, however, as they are easy to track with a straightforward intraprocedural analysis. As we saw in §1.3, tree-structured messages are the norm in practice. In our experience, only compilers tend to communicate DAGs between stages, while typical server software use treestructured events and messages by convention. For this reason, it is not a stretch to harden this ∗ This is the reason for the @pausable annotation, instead of the weaver discovering it behind the scenes. CHAPTER 4. TYPE SYSTEM 51 pattern into a requirement. Tree-structured messages simplify alias analysis tremendously, and are fast to serialize onto a network. We say an object is owned if there exists another object in the heap that refers to it via some field: owned(ρ, H ) ¬ ∃ρ0 . ∃ f . H (ρ0 )( f ) = ρ Definition 4.1 (Unique Ownership). An object is uniquely owned if there is only a single “incoming arrow”. unique(ρ, H ) ¬ (H (ρ0 )( f ) = ρ) ∧ (H (ρ1 )(g ) = ρ) =⇒ ρ0 = ρ1 ∧ f = g for all ρ0 , ρ1 , f and g This definition permits loops (including self-loops) and does not care whether or not an object is garbage collectable. However, the type rule for field assignment (S-PUTFIELD on p. 65) prevents cycle creation. • Deep safety. The type system adds the notion of a safe message to the KAM. A message marked safe can never be exported, updated, or grafted (assigned to another object’s field). Safety is deep: if an object is safe, all members of its cluster are guaranteed safe as well. Contrast this to keywords such as Java’s final or C++’s const, where marking a reference-valued variable or field as final does not prevent the object it points to from being mutated, rendering the concept fairly useless in practice. We will later use a predicate safe : Id 7→ {true, false}, which is initialized from a type qualifier, discussed next. • Local, modular reasoning. We use annotations or type qualifiers on method parameters as the sole mechanism to convey extra semantic information and programmer intent. This permits each function to evolve independently (as long as it holds up its end of the contract) and to be analyzable independently and quickly. We use a type system to infer and reason intra-procedurally about types akin to the JVM verifier. In addition, the type system tracks the state (typestate) of each message object, described next. • Type qualifiers for consumption semantics. Our annotations ({whole, part, safe}) are used to specify the initial typestate of the supplied argument, and secondly, serve as an ownership or consumption contract between caller and callee. – The safe annotation is the mechanism to render the corresponding formal argument and its cluster safe (as defined above) for the duration of the function’s activation, including nested calls. The callee guarantees that objects in the cluster remain unmodified and unexported. Objects marked safe can be passed as arguments to another method provided the corresponding parameters are annotated safe. 1 2 3 4 5 void foo (@safe print (x); put (b, x); y = x.f; put (b, y); M X) { // for some message class M // ok: print is safe. // error: put defined as void put(Mbx, @whole T) // y is transitively safe, .. // ... hence error z = new M; z.f = x; x.f = z; } // error: safe object cannot be assigned .. // .. nor assigned to 6 7 8 9 10 52 4.1. DESIGN CHOICES – Recall the definition of whole from p. 42; it refers to an object that is not reachable from any other object in the heap. The whole parameter annotation states two things: that it requires a whole object to be supplied, and second, it is the function’s intention to assume ownership of the whole message (the object and its cluster) and to do with it as it pleases, including exporting to another actor, or grafting it on to another message object (assigning to another object’s field), or letting the garbage collector collect it. We will refer to all these actions as consuming a message. After the call is over, the caller must not use any variable references to any part of the cluster; it is expected to sever its ties to the object completely. To preserve the tree structure, only whole messages can be on the right hand side of an assignment statement (this restriction is only for message types). Of course, once assigned, they cease to be whole. 1 2 3 4 5 6 7 8 9 10 11 void foo (@whole M x, @whole M y) { // for some message class M print (x); // Ok. print is safe, and does not consume x. x.f = y; // Ok. // y is no longer whole (it is part of x’s cluster) z = y.g; // z reachable from x and y. put (b, y); // Error. y is not whole. put (b, x); // ok. // x’s cluster (including y and z) has been sent // x, y, z are not valid any more (consumed) print (z); // Error. } – The part annotation is similar in that the callee can consume all or some part of the cluster, except for the argument object itself; that is, the callee is not allowed to assume ownership of the argument object (although it can mutate it). The other objects in the cluster can be exported (as long as they are dissociated from their parent objects). This annotation is typically used for the receiver in an OO setting; calling a method doesn’t consume the object itself, but its cluster may look quite different after the call. Going with a rough fishing analogy, it is as if the argument object is the hook; whether it comes back with a fish or without the worm is not known in advance, but the hook itself is not lost. 1 void bar (@part M z) {..,] // for some message class M 2 3 4 5 6 void foo (@part y.f = x; // x.f = y; // // M x, @whole M y) { Error. x is not whole. Ok. y is no longer whole (it is reachable from x) 7 8 9 z = x.g; // x.g = null; // z is now whole (severed from x, assuming x was a tree to begin with) 10 11 12 13 14 15 16 17 bar(x); // // // print (y); // put (b, z); // z = x.g; // } Ok. x not consumed, but any variables dependent on x will not be trusted. Hence ... Error. y dependent on x (from line 5) Ok. z is whole. Ok. x has not been consumed. CHAPTER 4. TYPE SYSTEM 53 We have thus far spoken of a method’s parameters taken one at a time, in that a parameter annotation dictates the pre- and post-states of the corresponding argument. There is a second part of the contract between caller and callee, which applies to pairs of arguments: each nonsafe argument must refer to clusters that are disjoint from those of every other argument. The following example illustrates the problem if such a restriction were not enforced: 1 2 3 4 void foo (@whole x, @safe y) { send(x); print(y); } 5 6 7 8 9 // elsewhere ... b = a.f; // b belongs to a’s cluster foo(a, b); // error: non-disjoint arguments Here, foo seems perfectly in order by itself; it is allowed to send a message x that it owns, and to use the object y safely. However, if x and y were part of the same message, as happens in a caller-side aliasing relationship (line 9), the actor inadvertently uses the object after sending it (line 3). We thus enforce the rule that an argument supplied to a non-safe parameter must be disjoint from all other arguments to the called method. Such a disjointness constraint is common to all linear type systems; the idea is that since the analysis is intra-procedural, each function can be independently analyzed with the knowledge that its mutable parameters have no aliasing relationship at function entry. Kilim annotations are designed to provide isolation guarantees, and are not just passive pieces of information. The reduction in verbosity comes from choosing convenient defaults, and by putting constraints on messages (and messages only). This helps distinguish our approach from other annotation approaches, including more general ones such as ownership types. Only widespread usage can tell whether the Kilim design decisions are deemed acceptable or whether the level of verbosity is small enough. 4.1.1 Putting them all together The design points outlined above signal an important shift in strategy from the earlier chapter. Instead of speaking of isolation between actors at message transfer points, we now enforce isolation and message structure at all method calls (even within an actor). In the KAM, sending a message may leave dangling references behind. The type system of this chapter is far stricter, in that it forces us to forget about any potentially dangling references after every method call. This strategy allows us to reason about each method in isolation, which makes it robust (or less brittle) in the face of software evolution. It also allows the type system to treat a mailbox as a library, not as a built-in primitive, which permits us to add other concurrency mechanisms in the future: 1 2 3 4 class Mailbox { @whole T get() {..} void put(@whole T msg) {...} } 54 4.2. PHASE-1: BASE TYPE INFERENCE Recall that this strategy is only for message types; other types are not constrained any further. The downside of course is the necessity of annotating every message parameter. From experience, making safe the default annotation reduces the syntactic burden substantially. 4.2 Phase-1: Base Type Inference This phase distinguishes between base types: messages, mailboxes, structs and integers. This phase is similar to the JVM type inference and verification process, but is considerably simpler than the JVM due to the lack of the inheritance. Fig. 4.1 shows the type lattice. The type > refers to an unknown type; there is no top level type such as Java’s Object. The special type null is held only by a distinguished reference also called null. ⊤ M M mbx S int null Figure 4.1: Kilim’s type lattice for base types. > represents an unknown type. The phase-1 type rules are shown in Figure 4.2. The rules are of the form Γ ` stmt ⇒ Γ0 , where Γ is the type environment, defined as a total function: Γ : Var → Type ∪ {null, >}. The algorithm is standard to the JVM type verification process.118 We construct a control flow graph (CFG) based on implicit transfer of control between adjacent statements in a method and on explicit transfer of control instructions like ifeq and return. A forward dataflow analysis is performed on the join semi-lattice 〈Γ, v〉 whose ordering and join operators are defined component-wise in terms of the type lattice (lub is the least upper bound): Γ1 v Γ2 ¬ ∀v.(Γ1 (v) ≤ Γ2 (v)) Γ1 t Γ2 ¬ λv.lub(Γ1 (v), Γ2 (v)) where v ∈ dom(Γ1 ) ∪ dom(Γ2 ) The starting point for this analysis is the CFG node for method entry, seeded with an initial environment Γinit for the current method m: ( τ if v = pi , where sig(m) = qi τi → τ r and args(m) = p Γinit ¬ λv. i > otherwise Remark. We have described our approach in terms of intraprocedural data flow analysis and abstract interpretation,132 as opposed to modern type systemic approaches that are typically specified in a functional style.145 It allows us to test our ideas as additions to existing toolchains, with no changes CHAPTER 4. TYPE SYSTEM T-NULL y :τ∈Γ T-NEW Γ ` x = y ⇒ Γ[x 7→ τ] T-PUTFIELD x : T,y : τ ∈ Γ f : τ ∈ fields(T ) Γ ` x. f = y ⇒ Γ x : τ0 , y : τ1 ∈ Γ τ0 , τ1 ∈ { τ, null } for some τ Γ` T-PUT T-CONST Γ ` x = null ⇒ Γ[x 7→ null] T-ASSIGN T-IF 55 x :M ∈Γ Γ` ifeq x y lb ⇒ Γ b : M mbx ∈ Γ put b : M mbx ∈ Γ, for all b ∈ b. T-SELECT Γ0 = Γ[x 7→ T ] Γ ` x = new T ⇒ Γ0 x : T,y : τ ∈ Γ T-PUTNULL T-RETURN T-GET b x ⇒Γ Γ ` x = n ⇒ Γ[x 7→ int] Γ ` x. f = null ⇒ Γ sig(m) = ∗ → τ r T-INVOKE T-SPAWN y :T ∈Γ f : τ ∈ fields(T ) Γ` return x : τ r ∈ Γ0 x ⇒ Γ0 b : M mbx ∈ Γ Γ ` x = get b ⇒ Γ[x 7→ M ] T-NEWMBX Γ ` x = select b ⇒ Γ[x 7→ int] T-GETFIELD f : τ ∈ fields(T ) Γ0 = Γ[b 7→ M mbx] Γ ` b = newmbx M ⇒ Γ0 Γ0 = Γ[x 7→ τ] Γ ` x = y. f ⇒ Γ0 sig(m) = qi Ti → τ r τi = Γ(y) τi ≤ Ti Γ ` x = m(y) ⇒ Γ[x 7→ τ r ] sig(m) = qi Ti → τ r Γ` spawn τi = Γ(y) τi ≤ Ti m(y) ⇒ Γ Figure 4.2: Type Inference: Phase 1. (mc denotes the function currently being analyzed) required of the syntax or the existing type system. Second, annotations at the bytecode level allow us to experiment with other JVM languages such as Scala and Clojure. Finally, we also pursued this approach due to a parallel research interest in information flow and security,59 which too requires one to track aliasing and provide flow-sensitive inference. 4.2.1 Phase-1 Termination and Soundness: Proof sketch Each transfer function ¹stmtº of Fig. 4.2 is monotonic. Formally, Proposition 4.1 (Monotonocity of ¹stmtº). For all statements stmt and for all pairs of type environments Γ1 and Γ2 such that Γ1 v Γ2 , we have ¹stmtº(Γ1 ) v ¹stmtº(Γ2 ). Termination of the phase-1 flow analysis algorithm follows automatically5 from the monotonicity of the transfer functions together with the lattice 〈Γ, v〉. We claim soundness of this phase of type inference and verification by drawing parallels to the soundness of the JVM verifier type inference algorithm. Klein and Nipkow104 demonstrate a mechanized operational and static semantics for a subset of the JVM, which is adequate for mapping our static and dynamic semantics. The phase-1 analysis does not deal with subtypes, arrays, casts and local 56 4.3. TYPESTATES AND HEAP ABSTRACTION subroutines, some of which render the JVM type system unsound,118 in the absence of a run-time check. The Java equivalent of a KAM message type is a super class Message inherited by all other message classes. The instructions related to spawning threads and mailboxes can easily be translated to equivalent Java library methods (as they are in the Kilim implementation as well). 4.3 Typestates and Heap Abstraction The second phase of static verification enforces structural, aliasing and ownership handoff constraints on message objects. This section describes an abstraction called the shape state graph, the domain of analysis for the message verification rules described in §4.4. The next two subsections describe the two components of the shape state graph, typestates and the shape graph abstraction. 4.3.1 Object typestate We touched on two important aspects in the introduction. The first is adding safety via the safe annotation, the guarantee being that a safe object is protected from modification, from grafting, and from export. The second is to eliminate heap aliasing. Let us imagine that the KAM is instrumented with a run-time type system that checks the predicates safe(ρ) and owned(ρ) before any heap manipulation and function call. Recall that safe(ρ) = true for all objects ρ reachable from a safe parameter (all others are implicitly mutable); a safe object remains so during the lifetime of this function’s activation. However, it is possible to protect an unsafe object temporarily in a nested function call activation, if the argument qualifier is safe. For convenience of analysis, we combine the two orthogonal predicates safe and owned into a single state map ς : Id 7→ {r, w, f}, as shown below: safe(ρ) ¬safe(ρ) owned(ρ) ς(ρ) = r (readonly) ς(ρ) = w (writable) ¬owned(ρ) ς(ρ) = r (readonly) ς(ρ) = f (free) If an object is safe, it doesn’t matter whether it is owned or not. A writable object can be written to, and a free object is one that is both writable and unowned. To this we add two more states: null, to track nullified variables and >, to indicate an unknown or invalid state. These states can be totally ordered as null < f < w < r < > to reflect a subtyping relationship. The following table shows the preconditions enforced by the run-time type system (we use the notation ρv to refer to the object σ(v)). They formalize the statements made earlier in §4.1. (We show only the instructions that pertain to object references) CHAPTER 4. TYPE SYSTEM Instruction Precondition x = y.f ς(ρy ) ≤ r x.f = null ς(ρ x ) ≤ w x.f = y ς(ρ x ) ≤ w, 57 ς(ρy ) ≤ f m(.., xi , ...) ς(ρ x ) ≤ f, where sig(m) = ∗ → τ r and τ r is a message type  r if qi = safe  1. ς(ρxi ) ≤ w if qi = part , where sig(m) = (..., qi τi , ...) → τ r  f if qi = whole 2. (cluster(ρi ) ∩ cluster(ρ j ) = ;) ∨ qi = q j = safe put b , x ς(ρ x ) ≤ f return x The precondition for method invocation checks that the clusters of two arguments are disjoint, if either or both of them is non-safe. The precondition for x. f = y requires that the object on the right hand side is free (unowned, writable), and is subsequently not permitted to be used on the right hand side of a field assignment as long it remains owned. This restricts the object graphs to be constrained to self-loops, rings and trees; however, an object may not be part of more than one cycle. Such nonoverlapping cycles are benign in that there is nothing useful that can be done with an object in a cycle; it is unexportable and unassignable because it is not free. In any case, we will show later that the static rules presented in §4.4 rule out the possibility of cycles entirely. 4.3.2 Shape graphs This section describes a scheme to statically track how objects connect to each other, whether an object is owned or not, and which objects belong to its cluster. There are any number of heap and pointer analysis schemes, all necessarily conservative (because precise alias detection is undecidable148 ) and the tradeoff is between speed of analysis, amount of programmer-supplied information, and precision. One property that we are particularly interested in is strong nullification, when two objects become no longer connected. Most analyses are unable to guarantee this property. 1 2 3 4 x = y.f y.f = null put ... y print x // // // // y points to x via f y and x become no longer connected. consume y and anything reachable ok. At run time, an actor’s heap is a graph containing one or more independent messages at a given program point. An infinite number of runs of a program will create an infinite number of such concrete heaps (at the same program point). It is possible to statically approximate all such heaps with a single abstract heap graph containing a finite set of nodes and edges. Each node of this graph is an equivalence class representing an infinite number of run-time objects. If in any given run, two run-time objects are directly connected, we model the connectivity as an edge between the corresponding abstract nodes. Therefore, each edge of the graph represents the fact that there exists some run in which two runtime objects represented by the corresponding nodes are connected. There are many ways of partitioning nodes into equivalence classes. We adopt the heap abstraction from the shape analysis work of Sagiv, Reps and Wilhelm154 for our purposes, because it satisfies 58 4.3. TYPESTATES AND HEAP ABSTRACTION the strong nullification requirement outlined earlier. Our approach is less general, and consequently simpler, because we want to enforce a tree shape, not accommodate all shapes. The key idea in shape analysis is this: imagine that at a given program point, each run-time object is labelled by the set of variable names referring to it (the variable alias set). Since the number of program variables (|V |) is finitely determined by the source code, there is a finite number of these labels (2|V | ), or equivalence classes into which all objects can be slotted, at any given program point. The emphasis is important because the set of variables referring to an object changes with variable assignment; in other words, an object may be represented by different alias sets at different points in the code. Figure 4.3 shows an example with objects and their ids in square boxes, labelled with the set of variables referring to them. Two snapshots of the heap are represented by a single shape graph, where node n x represents objects 10 and 43 because they are referred to by x, and node nyz represents 12 and 51. x 10 f y, z 1 2 3 4 5 6 7 8 x = y = t = u = z = x.f y.g z.h new new new new y = y = t = u M m M M 12 t 15 x g h u 17 y, z concrete graph after run 1 43 74 f g 51 h 82 t u concrete graph after run 2 abstract graph nx f nyz nt g h nu Figure 4.3: Concrete heap graphs after line 8, and their abstract version However, in the presence of an if/else block, different runs of the program may take one or the other distinct path, resulting in different concrete graphs depending on the path taken. We make the same choice as Sagiv et al. of merging all abstract graphs at control flow merge points. For example, Fig. 4.4 shows the merged shape graphs after the if/else block. Note that in the final configuration (after line 11), none of the abstract nodes’ labels contain d , as that variable is null. Note also the elimination of the b . g edge. 4.3.3 States + Shape graph = Shape-state graph In this section we assign a typestate to each node of a shape graph, forming a shape-state graph. We present an abstraction function β that translates a concrete runtime state δ to an abstract shape-state CHAPTER 4. TYPE SYSTEM 59 nbd f 1 2 3 4 5 6 7 na a = new M; b = new M; c = new M if ... a.f = b d = b else a.f = c d = c f [after line 9] 10 11 e = d.g d = null b.g = null ne nb f na g ncd 8 9 g f [after line 11] g ne nc Figure 4.4: Shape graphs at two program points. Note that the shape graph after line 9 is a merger of the graphs from each branch of the if/else block. graph ∆. We stress our focus on modular and local reasoning, which means the analysis is limited to the scope of a given function (intra-procedural analysis). We will also stress that the analysis is flow-sensitive, and the phrase at a given program point will be often expressed. This is similar to the flow-sensitive type qualifiers of Foster ,64 except that they use constraint solvers for alias analysis. Definition 4.2 (Concrete state). The concrete state δ is an activation frame augmented with object states: δ = 〈σ, H , ς〉. For example, the concrete state after run 1 in Fig. 4.3 is defined by: σ = {x 7→ 10, y 7→ 12, z 7→ 12, t 7→ 15, u 7→ 17} H = {〈10, f , 12〉, 〈12, g , 15〉, 〈12, h, 17〉} ς = {10 7→ f, 12 7→ w, 15 7→ w, 17 7→ w} Definition 4.3 (Shape-state graph). Define the shape state graph ∆ at a given program point as follows: ∆ ∈ SG = nT ∈ Node = e ∈ Edge = Node × Field × Node An abstract edge 〈nS , f , nT 〉 ∈ Edge exists if a run-time object represented by nS is connected via field f to another object represented by nT . Σ ∈ State = Node 7→ {r, w, f, null, >} The state of an abstract node. Node × Edge × State Set of abstract nodes and edges and node state map. A graph node with the id nT represents all run-time objects referred to by the variable alias set, or label, T at a given program point. The node n; denotes all objects in the heap that are not directly referred to by any variable. 60 4.3. TYPESTATES AND HEAP ABSTRACTION For example, the shape-state graph of Fig. 4.3 is defined by ∆ = 〈Ns, Es, Σ〉, where Ns = {n x , nyz , n t , n u } Es = {〈n x , f , nyz 〉, 〈nyz , g , n t 〉, 〈nyz , h, n u 〉} Σ = {n x 7→ f, nyz 7→ w, n t 7→ w, n u 7→ w} Definition 4.4 (Label). Define a labeling function L as the set of variable aliases referring to a runtime object ρ as follows: L (ρ, δ) = { v ∈ Var | σ(v) = ρ, where δ = 〈σ, ∗, ∗〉 }. For brevity and ease of presentation we will use the forms L (ρ) to implicitly denote L (ρ, δ) and L 0 (ρ) for L (ρ, δ 0 ). Definition 4.5 (Abstraction Function β). Define a family of functions β that transforms componentwise the concrete state δ = 〈σ, H , ς〉 to a shape state graph ∆ = 〈n, e, Σ〉. Please note that for ease of presentation, we use the symbols σ, ρ etc. in the implicit context of δ. ∆ = β(〈σ, H , ς〉) ¬ 〈β(σ), β(H ), β(ς)〉 Abstraction of concrete state. n = β(σ) ¬ { β(ρ) | σ(v) = ρ, v ∈ dom(σ) } Set of abstract nodes. e = β(H ) ¬ { β( (ρ, f , ρ1 ) ) | H (ρ)( f ) = ρ1 } Set of abstract edges. n = β(ρ) ¬ nL (ρ) Abstract node. e = β(〈ρ, f , ρ1 )〉 ¬ 〈β(ρ), f , β(ρ1 )〉 Abstract edge if 〈ρ, f , ρ1 〉 ∈ H . Σ = β(ς) ¬ λnT . An abstract node’s state represents the least upper bound of the concrete states of its constituent objects. F ς(ρ) T =L (ρ) The partial order for shape-state graphs ∆1 = 〈Ns1 , Es1 , Σ1 〉 and ∆2 = 〈Ns2 , Es2 , Σ2 〉 is defined as follows: ∆1 v ∆2 ¬ (Ns1 ⊆ Ns2 ) ∧ (Es1 ⊆ Es2 ) ∧ (Σ1 v Σ2 ) where (Σ1 v Σ2 ) ¬ Σ1 (n) v Σ2 (n) for all n ∈ Ns1 We say ∆2 is more conservative than ∆1 if ∆1 v ∆2 . The graph merge operation of Fig. 4.4 is defined as follows: ∆1 t ∆2 ¬ (Ns1 ∪ Ns2 , Es1 ∪ Es2 , Σ0 )  Σ (n) if n ∈ / dom(Σ2 )  1 0 where Σ = λn. Σ2 (n) if n ∈ / dom(Σ1 )  Σ1 (n) t Σ2 (n) otherwise The abstract interpretation process introduces two levels of imprecision: (i) a node’s state is the least upper bound of the states of the objects it represents (Def. 4.5), and (ii) the merging of shape graphs at CFG merge points as defined above. In spite of this imprecision, the shape-state graph contains a wealth of information, which we examine next. CHAPTER 4. TYPE SYSTEM 4.3.4 61 Properties of Shape-State graphs In this section we study the inverse of abstraction: concretization. Given a shape-state graph, we examine what can be definitely asserted about all the concrete states it represents. This idea is important because during static analysis, we have necessarily to work only with abstract graphs and draw conclusions from them. Compatibility Observe that in any given run of the program, and at a particular program point, if x and y are aliases (refer to the same object), and x and v are aliases, it necessarily means that all three are aliases; that is {x, y} and {x, v} are not distinct alias sets. It follows that the corresponding abstract graph cannot have two nodes with non-disjoint labels (nxy and nxv ). Formally, β(ρ) 6= β(ρ0 ) =⇒ ρ 6= ρ0 However, since the shape analysis algorithm merges multiple shape graphs (Fig. 4.4), the resulting graph may have nodes whose labels are not disjoint. This is not a problem; it merely represents a disjunction of possibilities (“either x is aliased with y or x is aliased with u, but never at the same time”). This property will be used to tease apart some of these disjunctions using the notion of node compatibility. Definition 4.6 (Compatibility). Two nodes are defined as compatible only if they can possibly represent objects from a single concrete state compatible(nA, nB ) ¬ A = ; ∨ B = ; ∨ (A 6= B ∧ A ∩ B = ;) Clearly, there can never be an edge between two incompatible nodes, even with the merging process. This is called the edge compatibility constraint.154 The rules described in the next section maintain this invariant. Reachability We define nA ∆ nB in the same manner as the concrete reachability relation of Defn. 3.1 (p. 45), to mean a path between two nodes in a shape graph. A concrete edge has a corresponding abstract edge in the shape state graph. That is, H (ρ1 )( f ) = ρ2 =⇒ ρ1 H ρ2 =⇒ 〈β(ρ1 ), f , β(ρ2)〉 ∈ ∆ (by construction) =⇒ β(ρ1 ) ∆ β(ρ2 ) Since merging of abstract graphs does not lose information (edge sets are simply merged), a path between concrete objects definitely has a counterpart between the corresponding abstract nodes (transitive argument). Conversely: 〈β(ρ1 ), f , β(ρ2)〉 ∈ /∆ =⇒ ¬(H (ρ1 )( f ) = ρ2 )) This means that, if there is no path between two shape nodes, we can assert that there exists no path between two corresponding run-time objects. In contrast, if there is a path between two shape nodes, it may or may not correspond to an equivalent run-time path. At the very least, all the nodes on the path must be pairwise compatible for there to be an equivalent concrete path (if at all). We use this property to define a shape cluster. 62 4.3. TYPESTATES AND HEAP ABSTRACTION x x y f h g k (a) x f y f y h y g g k (b) (c) Figure 4.5: Materialization: Extracting the target objects of expression x. f Definition 4.7 (Shape Cluster). We define a shape cluster as all the compatible shape nodes reachable from a given node. scluster(n : Node, ∆) ¬ Smallest set X such that n ∈ X ∧ (n 0 ∈ X =⇒ n ∆ n 0 ∧ compatible(n, n 0 )) Note that unlike cluster (Fig.3.3), the shape cluster is not a reflexive, transitive closure of the reachability relation; it is a subset. What is important is that cluster(ρ, H ) is represented completely by scluster(β(ρ), ∆). The message verification rules described in §4.4.2 use this property. When an object is supplied as an argument to a whole parameter (in a method invocation), we approximate all possible reachable concrete objects by the corresponding reachable cluster in the shape graph. All such nodes are considered suspect, in that they may refer to exported object references after the call. Materialization and sharing Figure 4.5a shows an abstract graph, and subfigure 4.5b gives an idea of the concrete possibilities it represents (and what it doesn’t), across many runs of the program. A few of the concrete nodes have incoming and outgoing edges, some have only one or the other, and others have no edges. The only factor common to parallel concrete edges (the x. f edge, for example) is that their source and target endpoints have the same label. This leads us to two important observations. First, the presence of multiple abstract edges incident on an abstract node ( f and g in Fig. 4.5a) do not necessarily mean that a constituent concrete node is actually heap aliased; conventional shape analyses maintain a separate is-shared predicate to know whether there are two or more concrete incoming edges. In our case, the static semantics consult the state map Σ to prevent aliasing (see S-PUTFIELD in §4.4.2). This allows us to make the definite claim that multiple edges incident on an abstract node merely represent the overlaying of well-formed concrete trees. Second, during the abstract interpretation process, if we encounter a statement of the form y = x. f , the figure should clarify that although the abstract edge x. f is incident on the node y, it targets only a fraction of y’s concrete nodes (shown as y • ). Because we ensure that a concrete object can never be shared, the fraction of nodes y • targeted by x. f is disjoint from the rest, y ◦ . Later, we will find it necessary to “scoop out” this fraction, or materialize as it is known in the shape analysis literature.154 Fig. 4.5c shows the effect after materializing the target of x. f . Note that there are no incoming g -labelled edges incident on y • . However, we can never know how the outgoing concrete edges are split up. For this reason, we have to be conservative and add abstract outgoing edges (h and k) from both y • and y ◦ . CHAPTER 4. TYPE SYSTEM xy 63 uv xy v kill(u) u ϕ ϕ Figure 4.6: Kill State of a variable A node defines a set of variable aliases. Suppose the states of two nodes are given as follows: Σ(nuv ) = f and Σ(nvz ) = r. This indicates that using the variable u will always yield a free object (unowned, writable), and using z will always be a safe object. However, the variable v will in some cases refer to free objects, and sometimes to safe objects. The only conservative assumption about the state of the object obtained via v is to take the least upper bound of the states. We overload the function Σ as follows: F Σ(v : Var) ¬ ς(n) { n∈∆ | v∈vars(n) } where vars(nT : Node) ¬ T Kill We can scavenge a dead variable from an activation frame without changing the behavior of the program. We define a kill pseudo-instruction for both the concrete and shape states. The concrete state merely removes the variable from its list of local variables; the corresponding change to the shape state graph may be larger because some nodes merge with n; (Fig.4.6) Definition 4.8 (Kill). Given δ = 〈σ, H , ς〉 and ∆ = 〈Ns, Es, Σ〉, define the overloaded functions kill as follows. We use the notation nX −v ¬ nX \{ v } , to strip a variable off a node’s label X . 〈σ 0 , H , ς〉 where ( undef if v = v 0 0 λv . σ(v) otherwise kill(δ, v : Var) ¬ σ0 = kill(∆, v : Var) ¬ 〈Ns0 , Es0 , Σ0 〉 where Ns0 = { nX −v | nX ∈ Ns } Es 0 = { 〈nS−v , f , nT −v 〉 | 〈nS , f , nT 〉 ∈ Es } Σ 0 = λnX . Σ(nX ) t Σ(nX ∪{ v } ) We use kill in our typestate rules not just to scavenge, but also to invalidate certain variables after a method call. If a program survives such invalidation, we know that the variables are dead by fiat, and can thus be scavenged. In the limit case, one could invalidate all variables and be left only with n; ; the post-call abstract state would be accurate (no objects are labelled), but possibly completely useless as there are no valid variables left. The rules in the next section strive to keep the heap model accurate without losing precision unduly. 64 4.4. PHASE-2: MESSAGE VERIFICATION 4.4 Phase-2: Message verification We now have all the mechanisms required to discuss the message verification process. The semantics rules, enumerated next, are of the form Γ; ∆ ` stmt ⇒ Γ; ∆0 . Each rule checks preconditions that conservatively cover the preconditions outlined in §4.3.1, and acts as a transfer function to produce the new shape state ∆0 . The type environment from phase-1 remains constant in this phase. As in phase-1, we perform an abstract interpretation on the control flow graph of each function, using the lattice 〈∆, v〉 and the merging operator defined in §4.3.3. 4.4.1 Normalization For ease of presentation we simplify the rules by factoring out some aspects common to all instructions. This is done using a normalization pass on the code. We rewrite each statement that uses the same variable on both the left and right-hand sides to ones using fresh temporary variables. For example x = x. f becomes t = x. f ; x = t . Next, each non-null assignment instruction of the form x = . . . is preceded with kill(x) to separate the “kill” step from the “gen” step. By the same token, all statements of the form x. f = y are assumed to be preceded with x. f = null. 4.4.2 Semantic Rules Notation. We define ∆ = 〈Ns, Es, Σ〉 and ∆0 = 〈Ns0 , Es0 , Σ0 〉 in all the rules below, and use ∗ as a wildcard while selecting edges. We use the following auxiliary function in the rules. nodes(v : Var) ¬ { n ∈ Ns | v ∈ vars(n) } New. The S-NEW rule adds a fresh node n x to the graph. No new edges are added. S-NEW Ns0 = 0 Es = Es Σ0 = Σ[n x 7→ f] Ns ∪ { n x }, n x is fresh Γ; ∆ ` x = new M ⇒ Γ; ∆0 Assign. This rule merely relabels the nodes; the auxiliary function a x adds x to a node’s label if it already has y. (F ranges over field names). The new graph is isomorphic to the old one, modulo x. ` Σ(y) < > ( let a x ¬ λnZ . S-ASSIGN nZ∪{ x } if y ∈ Z nZ otherwise Ns0 = { a x (nZ ) | nZ ∈ Ns } Es0 = { 〈a x (nS ), F , a x (nT )〉 | 〈nS , F , nT 〉 ∈ Es } Σ0 = λnT . Σ(nT \{ x } ) Γ; ∆ ` x = y ⇒ Γ; ∆0 NullVar. Recall that x has already been killed by normalization (kill). We introduce a dummy node n x so that the shape state records the fact that x has been initialized to null. n x does not correspond CHAPTER 4. TYPE SYSTEM 65 to any object in the heap. Note that the shape graph can be widened to introduce new nodes that do not correspond to any objects in the heap, without affecting the fact that every concrete object has a counterpart in the shape state graph. S-NULLVAR Ns0 = Ns ∪ { n x } Es0 = Es Σ = Σ[n x 7→ null] 0 Γ; ∆ ` x = null ⇒ Γ; ∆0 NullField. As long as x is writable and non-null, we can remove all edges of the form x. f . This strong nullification property is one of the hallmarks of the shape analysis framework. The original target nodes of x. f become free if (and only if) they are no longer reachable in the new configuration. ` f ≤ Σ(x) ≤ w Ns0 = Ns 0 Es = Σ0 = Es \ { 〈nX , f , ∗〉 ∈ Es | x ∈ X ) } ( f if Σ(nZ ) = w ∧ 〈nX , f , nZ 〉 ∈ Es ∧ 〈∗, ∗, nZ 〉 ∈ / Es0 where x ∈ X λnZ . Σ(nZ ) otherwise S-NULLFIELD Γ; ∆ ` x. f = null ⇒ Γ; ∆0 PutField. The precondition is that all x-nodes (X ) are writable and all y-nodes (Y ) are free. The precondition that the nodes of x and y be disjoint ensures that a cycle is not created. For example, x. f = y would create a cycle if x and y were aliases. The rule connects all X to all Y with an f -edge (strong update), then downgrades the y-nodes to writable. This operation has no effect if Y is empty (meaning x. f is null along all CFG paths); recall that normalization has already removed all x. f edges prior to the evaluation of this rule. ` Σ(x) ≤ w, Σ(y) ≤ f, x 6= y ` nodes(x) ∩ nodes(y) = ; S-PUTFIELD Ns0 = Ns Es0 = Σ0 = Es ∪ { 〈nX , f , nY 〉 | Σ(y) = f } where x ∈ X , y ∈ Y ( w if 〈nX , f , nZ 〉 ∈ Es0 where x ∈ X λnZ . Σ(nZ ) otherwise Γ; ∆ ` x. f = y ⇒ Γ; ∆0 GetField. Informally, for a statement of the form x = y. f , we identify all the y. f edges in the graph and add x to the label of the target nodes of these edges. But due to materialization (Fig. 4.5), we can be more precise, albeit with a concomitant increase in the complexity of the rule. In the example figure below, the S-GETFIELD rule transforms the shape state graph on the left to the one on the right. 66 4.4. PHASE-2: MESSAGE VERIFICATION nb g nya f f nyd f nb f g f nc S-GetField nya f ncx h f h ne nyd f nc nex Denote the source and target nodes of these edges by Ys and Ts respectively. In the example, Ys = { nya , ny d }, and Ts = { nc , ne }. We identify a subset of the target nodes YFs as those that only have incoming y. f edges and no other; in the example, YFs = { ne }. These can be simply relabeled, as shown by the ne x node on the right, because the runtime objects represented by these shape state nodes are all exclusively targets of y. f nodes. Each node in the rest of the target nodes (that is, in Ts\YFs = { nc }) represents (i) runtime objects that are targets of y. f , and (ii) unrelated runtime objects that are targets of other unrelated edges. These two sets represent disjoint sets of runtime objects, given the invariant that no message object may be pointed to by more than one object at run time. After the transformation, the first part is materialized out as a separate node nc x , while the latter continues to be nc . ` f ≤ Σ(y) ≤ r let Ys = { nY | 〈nY , f , ∗〉 ∈ Es, nY ∈ Y } let Ts = { nT | 〈nY , f , nT 〉 ∈ Es, nY ∈ Y } let YFs = Ns0 = {nT | @nZ . (〈nZ , F , nT 〉 ∈ Es ∧ nT ∈ Ts ∧ (y ∈ / Z ∨ F 6= f ))} Ns \ YFs ∪ { nT ∪{ x } | nT ∈ Ts } Es0 = Es \ { 〈nY , f , nT 〉 | nY ∈ Ys, nT ∈ Ts }) 1 \ { 〈nT , ∗, ∗〉 | nT ∈ YFs } 2 ∪ { 〈nY , f , nT ∪{ x } 〉 | nY ∈ Ys, nT ∈ Ts } 3 ∪ { 〈nT ∪{ x } , F , nU 〉 | 〈nT , F , nU 〉 ∈ Es } 4 Σ0 S-GETFIELD =  F w t Σ(nY ) if 〈nY , f , nZ 〉 ∈ Es0 5   ny  λnZ . null if Z = { x } ∧ Ys = ; 6   Σ(nZ\{ x } ) if nZ\{ x } ∈ / YFs 7 for all nY , nT ∈ Ns, y ∈ Y Γ; ∆ ` x = y. f ⇒ Γ; ∆0 = (Ns0 , Es0 , Σ0 ) For edges, we remove all y. f edges incoming to Ts (1) and retarget them to the updated targets (3). We remove all edges outgoing from YFs (2) because these will not have a counterpart in the new shape CHAPTER 4. TYPE SYSTEM 67 state. Finally, we add edges from the updated Ts (4). Note that nya and ny d point to the new node nc x , while the n b to nc edge is preserved. As for the typestate, nodes in the updated Ts get their state from their parent Ys in the new configuration (5), or null if there were no y. f edges at all (6). All others remain the same (7). Method Call. The method call ensures that non-safe arguments are disjoint from every other argument, then invalidates (kills) all variables that could potentially be pointing to an object that may have been exported by the method call. All linear type systems have variants of this style of consumption. let sig(m) = qτ → τ r let vcluster(v : Var) = S n∈nodes(v) scluster(n) let disjoint(v, v 0 ) = vcluster(v) ∩ nodes(v 0 ) = ; where v, v 0 ∈ Var ( S vcluster(yi ) if qi = whole let Nk = i vcluster(yi ) \ nodes(yi ) if qi = part S let Vk = n∈N vars(n) k ` Σ(yi ) ≤ qi ` (qi = q j = safe) ∨ disjoint(yi , y j ), for all i 6= j ∆0 = fold(kill,Vk , ∆) Ns00 = Ns0 ∪ { n{ v } | v ∈ Vk } Es00 = Σ00 = Es0 ∪ { 〈n{ x } , f , n; 〉 | f ∈ fields(τ r ) } if τ r is a message type  > if T ∩ Vk 6= ;  λnT . f if T = { x } and τ r is a message type  Σ0 (nT ) otherwise ∪ { n x }, provided τ r is a message type S-CALL Γ; ∆ ` x = m(y) ⇒ Γ; ∆00 The function vcluster is a union of all clusters accessible from each of nodes(v) and when adjusted for the qualifier (Nk ), corresponds to concrete objects that are reachable from v that are permitted to be exported or grafted onto other objects. We have no way of knowing how the callee will change these objects. The variables in the set Vk belong to the labels of nodes in Nk , and which may be left with dangling references after the call. For this reason, we conservatively invalidate them all using the functional fold function to kill each variable successively. Consider the example below. 1 2 int h(safe if ... y = a 3 x = new M y = new M x.f = y 5 6 7 nx ny send(x) nodes(y) = { nay , ny }, scluster(n x ) = { n x , ny } //Consume vcluster(x) = scluster(n x ) //After call; Σ(n x ) = Σ(ny ) = > and Σ(na ) = r print(a) // ok. //Before call: 11 12 Σ(n x ) = f } 9 10 Σ(nay ) = r else 4 8 a) { 68 4.4. PHASE-2: MESSAGE VERIFICATION Disjointness of a non-safe argument from all other arguments is verified by ensuring that its vcluster does not contain the nodes of the other arguments. The callee can thus be assured that if any of its input arguments is writable, it is independent of any other argument. Using kill serves another important purpose, besides invalidation. It eliminates uncertainty about the post-call state by retaining only those variables and nodes and edges that are guaranteed to remain untouched; these include safe arguments and those inaccessible to the called method. This way, we ensure that both the concrete and abstract models have been brought into sync. Note that we introduce dummy nodes corresponding to all the invalidated variables to record the fact that they were explicitly invalidated (> indicates the status is unknown). This is a technical device which ensures that, after a graph merge (at CFG merge points), we can tell the difference between a variable that was invalidated on one path and not on the other, as shown below. 1 2 3 4 5 if ... { a = y send(y) // Σ(ny ) = Σ(na ) = > } else { b = y = new M // 7 } // Merge point 8 // Graph contains 6 Σ(nby ) = f Σ(ny ) = Σ(na ) = >, Σ(nby ) = f In this example, it is safe to use b (at line 6) because it only refers to nby , whose typestate is valid. But it is not safe to use y because its state is Σ(ny ) t Σ(nby ) = >. Finally, we create edges between the return variable x and the summary node n; to record the fact that we have a free object whose subtree is unlabelled. Method Entry If the current method is called m, we introduce a shape node for each message argument pi , a dummy variable and node p̂i to represent the parent node, if the argument is part (to keep track of ownership), and edges from these dummy nodes and to the summary node n; , to represent potential concrete edges. let sig(m) = q τ → ∗ let p = args(m) S-ENTRY size Note: i ranges from 1 to |args(i )| S Ns = n , where qi 6= none Si { pi } ∪ i n{ p̂i } , where qi = part and p̂i is fresh S Es = 〈n , f , n 〉 for all f ∈ fields(τi ) Si { pi } ˆ ; ∪ i 〈n{ p̂i } , f , n pi 〉 where qi = part, and fˆ is a fresh field    r if qi = safe      Σ = n{ pi } 7→ w if qi = part     f if qi = whole ∆ = (Ns, Es, Σ) Put, Get The rules presented so far permit us to treat transfer whole messages: put and get as ordinary function calls that CHAPTER 4. TYPE SYSTEM sig(put) = (M mbx, whole sig(get) = (M mbx) → 〈whole 69 M) → ∗ M〉 In practice, this permits us to implement communication primitives as library methods and not give them special treatment in the type checker. This way, we can implement other communication primitives such as futures and Haskell style MVars. We have also investigated using this type system for unique ownership and linear transfer of ownership of security capabilities.59 Note that spawn and select do not have message type parameters, and so do not need to be dealt with in this phase. 4.5 Soundness of message verification This section shows that the message verification rules (§4.4) conservatively simulate the operational semantics of §3.2, or more precisely, the operational semantics augmented with a run-time type system ( §4.3.1). Any program that passes the static test will never encounter dangling references and will always maintain the tree structure invariant of messages. We will use the notation ¹. . .ºδ and ¹. . .º∆ to represent the operational and static rules respectively. Proposition 4.2 (Monotonocity). For all semantic rules of the form ¹. . .º∆ and for all pairs of shape states ∆1 and ∆2 such that ∆1 v ∆2 , we have ¹·º∆ (∆1 ) v ¹. . .º∆ (∆2 ) The proof structure is aligned closely to the excellent paper by Sagiv et al.154 Our proof is necessarily simpler because objects are not heap-aliasable; on the other hand, we add function invocation and consumption policies and treat null specially (which permits us to detect null-dereferences in some cases). Informally, the proof structure is as follows. Any run-time heap graph has a corresponding shape state graph. We prove that for a verified program, the transformation of the run-time graph due to any instruction (given the rules in Fig. 3.4 and 3.5, p. 46) is paralleled in a conservative fashion by the transformation of the corresponding shape state graph (the rules in §4.4.2). Figure 4.7 shows a concrete pre-state δ = 〈σ, H , ς〉 transformed to a concrete post-state δ 0 = 〈σ 0 , H 0 , ς 0 〉. Likewise, the static verification rules transform the abstraction of the concrete-pre-state, the shape state graph ∆, to ∆0 . To assert that ∆0 soundly models δ 0 , we compare it to the shape-state ˆ (= β(δ 0 )), and verify that for each rule, ∆ ˆ ⊆ ∆0 . graph of the post-state, ∆ Theorem 4.1 (Local Safety Theorem). For all statements stmt and for every concrete evaluation ¹stmtºδ (〈σ, H , ς〉) 〈σ 0 , H 0 , ς 0 〉 β(〈σ 0 , H 0 , ς 0 〉) v ¹stmtº∆ (β(〈σ, H , ς〉)) PROOF . ˆ v ∆0 , we need to show that each component is a subset of the other’s. To prove ∆ n ∈ N̂s =⇒ n ∈ Ns0 e ∈ Ês =⇒ e ∈ Es0 Σ̂(n) v Σ0 (n), for all n ∈ N̂s 70 4.5. SOUNDNESS OF MESSAGE VERIFICATION βδ δ = !σ, H, ς" !stmt"δ δ ! = !σ ! , H ! , ς ! " ∆ = !Ns, Es, Σ" to prove βδ! ˆ = !Ns, ˆ Es, ˆ Σ̂" ∆ ! !stmt"∆ ∆! = !Ns ! , Es ! , Σ! " Figure 4.7: Parallel execution of concrete and abstract semantics ˆ to indicate the direction of the proof, with We will use phrases such as “going from δ → δ 0 → ∆” reference to Fig. 4.7. Case 1 β(¹killºδ (δ)) v ¹killº∆ (β(δ)), where δ = 〈σ, H , ς〉 Let the variable to be killed be x. Let σ(x) = ρ. The object ρ is the only one to have its label modified in the concrete semantics;the heap remains the same. Define X = L (ρ) and X0 = X \{x } I. (Proving N̂s ⊆ Ns0 ). ˆ Going from δ → δ 0 → ∆ σ(x) = ρ =⇒ L 0 (ρ) = X \ { x }, S =⇒ N̂s = { nX 0 } ∪ ρ0 6=ρ nL (ρ0 ) , Similarly, going from δ → ∆ → ∆0 S Ns = { nX } ∪ ρ6=ρ nL (ρ) , S =⇒ Ns0 = { nX0 } ∪ ρ6=ρ nL (ρ) . from Def. 4.8 ˆ from δ 0 → ∆ from δ → ∆ x removed, but L (ρ) unchanged. Hence, N̂s ⊆ Ns0 II. (Proving Ês ⊆ Es0 ). Similar to above; the end point labels have changed, but the edges are unchanged III. (Proving Σ̂ ⊆ Σ0 ). Let ς x = ς(ρ) = ς 0 (ρ), since the typestate is unchanged. There are three subcases: (a) For all objects ρ0 6= ρ, the label is unchanged; if Z = L (ρ0 ) = L 0 (ρ0 ), it is easy to see that Σ̂(nZ ) = Σ(nZ ) = Σ0 (nX ) (b) Label changed for ρ, but X 0 6= ;. Since a variable can only point to a single object, and since X 0 is non-empty, nX 0 must represent only ρ. Σ̂(nX 0 ) = ς 0 (ρ) = ς(ρ) = Σ(nX ) = Σ(nX 0 ) (c) Label changed for ρ, and X 0 = ;. Let ς; be the least upper bound of all typestates of all objects without a label. Since X 0 = ; , ρ is represented by n; , and hence influences n; typestate. Going from δ → δ 0 → Σ̂, CHAPTER 4. TYPE SYSTEM 71 ˆ from δ 0 → ∆ Σ̂(n; ) = ς 0 (ρ) t ς;0 Σ̂(n; ) = ς(ρ) t ς; Going from δ → ∆ → ∆0 , Σ(nX ) = ς(ρ), and Σ(n; ) = ς; =⇒ Σ0 (n; ) = Σ(n;∪{ x } ) t Σ(n; ) =⇒ Σ0 (n; ) = Σ(nX ) t Σ(n; ) =⇒ Σ0 (n; ) = ς(ρ) t ς; because neither typestate has changed from ¹killº∆ Hence, in all three cases Σ̂(n; ) ≤ Σ0 (n; ) Case 2 β(¹x. f = nullºδ (δ)) v ¹x. f = nullº∆ (β(δ)), where δ = 〈σ, H , ς〉 I. (Proving N̂s ⊆ Ns0 ). Follows directly from the fact that σ is unmodified in ¹x. f = nullºδ II. (Proving Ês ⊆ Es0 ). Going from δ 0 → δ → ∆ → ∆0 Let edge 〈ρ1 , g , ρ2 〉 ∈ H 0 =⇒ 〈ρ1 , g , ρ2 〉 ∈ H ∧ ((σ(x) 6= ρ1 ) ∨ ( f 6= g )) =⇒ 〈nL (ρ1 ) , g , nL (ρ2 ) 〉 ∈ Es ∧ ((x ∈ / L (ρ1 )) ∨ ( f 6= g )) =⇒ =⇒ 〈nL (ρ1 ) , g , nL (ρ2 ) 〉 ∈ Es0 from ¹x. f = nullºδ from S-PUTNULL 0 〈nL 0 (ρ1 ) , g , nL 0 (ρ2 ) 〉 ∈ Es because σ 0 = σ. ˆ Going from δ 0 → ∆ 〈ρ1 , g , ρ2 〉 ∈ H 0 =⇒ 〈nL 0 (ρ1 ) , g , nL 0 (ρ2 ) 〉 ∈ Ês ˆ =⇒ e ∈ ∆0 Hence, e ∈ ∆ III. (Proving Σ̂ ⊆ Σ0 ). Let ρ1 = σ(x) and 〈ρ1 , f , ρ2 〉 ∈ H . We will examine ρ2 separate from all the others (because only its typestate is modified). This is the only rule where a node’s typestate is upgraded to free. ς(ρ2 ) = w, (because the preconditions ensure that its parent ρ1 is not marked safe). ς 0 (ρ2 ) = f, (from ¹x. f = nullºδ , ρ2 is no longer owned, and was never marked safe) let Σ(nZ ) = ς(ρ2 ) t ςrest , where nZ node represents ρ2 , and other objects whose collective typestate (least upper bound) is denoted by ςrest (which remains unchanged). (a) If ςrest is r, at least one of the other objects is marked safe: =⇒ Σ0 (nZ ) = Σ(nZ ) = r from ¹x. f = nullº∆ (upgrade to f not possible) ˆ Going from δ → δ 0 → ∆, Σ̂(nZ ) = ς 0 (ρ2 ) t ςrest Σ̂(nZ ) = r (b) If ςrest is w, there must be at least one non x. f edge in H , which means there must be an equivalent edge in Es: =⇒ Σ0 (nZ ) = Σ(nZ ) = w from ¹x. f = nullº∆ . (upgrade to f not possible) 0 ˆ Going from δ → δ → ∆, Σ̂(nZ ) = ς 0 (ρ2 ) t ςrest Σ̂(nZ ) = w 72 4.5. SOUNDNESS OF MESSAGE VERIFICATION (c) If ςrest is f, there are no incoming edges (ρ2 was holding the others back, so to speak). If nZ has incoming edges (being conservative), we cannot upgrade to f. If not, it satisifies all the conditions of ¹x. f = nullº∆ : Σ0 (nZ ) = f from ¹x. f = nullº∆ 0 ˆ Going from δ → δ → ∆, Σ̂(nZ ) = ς 0 (ρ2 ) t ςrest Σ̂(nZ ) = f (d) For all other objects ρ, there is no change in ς(ρ). If ρ belongs to nZ as defined above (it belongs to rest), the sub-cases above show that Σ̂(nZ ) ≤ Σ0 (nZ ) If not, ς(ρ) ≤ (Σ̂(nL (ρ) ) = Σ(nL (ρ) ) = Σ0 (nL 0 (ρ) )) In all cases we have shown that Σ̂(n) ≤ Σ0 (n) Case 3 β(¹x. f = yºδ (δ)) v ¹x. f = yº∆ (β(δ)), where δ = 〈σ, H , ς〉 I. (Proving N̂s ⊆ Ns0 ). Follows directly from σ 0 = σ, in ¹x. f = yºδ II. (Proving Ês ⊆ Es0 ). The only concrete change is the introduction of a new x. f edge (a) Concrete edge was newly introduced Going from δ 0 → δ → ∆ → ∆0 Let 〈ρ1 , f , ρ2 〉 ∈ H 0 where σ 0 (x) = ρ1 ∧ σ 0 (y) = ρ2 =⇒ σ(x) = ρ1 ∧ σ(y) = ρ2 ∧ ς(ρ2 ) = f =⇒ nL (ρ1 ) , nL (ρ2 ) ∈ Ns =⇒ 〈nL 0 (ρ1 ) , f , nL 0 (ρ2 ) 〉 ∈ Es0 ˆ Going from δ 0 → ∆ from ¹x. f = yºδ δ →∆ from ¹x. f = yº∆ 〈ρ1 , f , ρ2 〉 ∈ H 0 =⇒ 〈nL 0 (ρ1 ) , f , nL 0 (ρ2 ) 〉 ∈ Ês (b) All other edges. By inspection, all unrelated edges are preserved and no new ones are created. Therefore, e ∈ Ês =⇒ e ∈ Es0 III. (Proving Σ̂ ⊆ Σ0 ). (a) Newly introduced edge 〈ρ1 , f , ρ2 〉 ∈ H 0 where σ 0 (x) = ρ1 ∧ σ 0 (y) = ρ2 Going from δ 0 → δ → ∆ → ∆0 ς 0 (ρ2 ) = w, owned by ρ1 by typestate definition =⇒ ς(ρ2 ) = f, precondition of ¹x. f = yºδ =⇒ Σ(nL (ρ2 ) = f δ →∆ 0 =⇒ Σ (nL 0 (ρ2 ) = w from ¹x. f = yº∆ 0 ˆ Going from δ → ∆ ς 0 (ρ2 ) = w =⇒ Σ0 (〈L 0 (ρ2 )〉) (b) All other edges. By inspection, unrelated edges are preserved and no new ones are created, and no other objects’ typestates are updated. Hence, Σ̂(n) ⊆ Σ0 (n) for all n ∈ Ns0 (Note: Ns0 = Ns) CHAPTER 4. TYPE SYSTEM Case 4 73 β(¹x = y. f ºδ (δ)) v ¹x = y. f º∆ (β(δ)), where δ = 〈σ, H , ς〉 I. (Proving N̂s ⊆ Ns0 ). Assume that the y. f edge refers to the concrete edge 〈ρ1 , f , ρ2 〉 ∈ H . We’ll refer to ρ2 as the target object. The only change in the concrete state is the update to ρ2 ’s label. L 0 (ρ2 ) = L (ρ2 ) ∪ { x } (a) Consider object ρ2 . Going from δ → ∆ → ∆0 x∈ / L (ρ2 ) =⇒ nL (ρ2 ) ∈ Ns =⇒ nL (ρ2 )∪{ x } ∈ Ns0 =⇒ nL 0 (ρ2 ) ∈ Ns0 ˆ Going from δ → δ 0 → ∆ nL 0 (ρ due to normalization δ →∆ from [3] in ¹x = y. f º∆ 2 )∈N̂s (b) All objects other than ρ2 . Since their labels don’t change, nL (ρ) is present in Ns, Ns0 and N̂s. Therefore, for all objects, nL 0 (ρ) ∈ N̂s =⇒ nL 0 (ρ) ∈ Ns0 II. (Proving Ês ⊆ Es0 ). As before, let y. f edge refer to the concrete edge 〈ρ1 , f , ρ2 〉 ∈ H . In the concrete semantics of ¹x = y. f ºδ , there are no changes to edges (H 0 = H ). We will consider three subcases from ρ2 ’s point of view; incoming concrete edges (there can be only one, due to well-formedness assumptions), outgoing edges, and all other edges that have nothing to do with ρ2 . (a) Incoming edge; this must be the y. f edge. ˆ Going from δ → δ 0 → ∆ 〈ρ1 , f , ρ2 〉 ∈ H =⇒ 〈ρ1 , f , ρ2 〉 ∈ H 0 =⇒ 〈nL 0 (ρ1 ) , f , nL 0 (ρ2 ) 〉 ∈ Ês Going from δ → ∆ → ∆0 〈ρ1 , f , ρ2 〉 ∈ H =⇒ 〈nL (ρ1 ) , f , nL (ρ2 ) 〉 ∈ Es, =⇒ 〈nL (ρ1 ) , f , nL (ρ2 )∪{ x } 〉 ∈ Es0 0 δ → ∆, where y ∈ L (ρ1 ) from ¹x = y. f º∆ 0 Hence e = 〈nL 0 (ρ1 ) , f , nL (ρ2 )〉 ∈ Ês =⇒ e ∈ Es0 (b) Edges outgoing from ρ2 ; consider 〈ρ2 , g , ρ3 〉 for some field g and some other message object ρ3 . ˆ as above Going from δ → δ 0 → ∆, 〈nL 0 (ρ1 ), f ,L 0 (ρ2 ) 〉 ∈ Ês Going from δ → ∆ → ∆0 〈nL (ρ2 ) , g , nL (ρ3 ) 〉 ∈ Es δ →∆ 0 =⇒ 〈nL (ρ2 )∪{ x } , g , nL (ρ3 ) 〉 ∈ Es from ¹x = y. f º∆ =⇒ 〈nL 0 (ρ2 ) , g , nL 0 (ρ3 ) 〉 ∈ Es0 Hence, e = 〈nL 0 (ρ2 ) , g , nL 0 (ρ3 ) 〉 ∈ Ês =⇒ e ∈ Es0 (c) All other concrete edges that have nothing to do with ρ2 . Consider 〈ρ3 , g , ρ4 〉 ∈ H , where ρ2 ∈ / { ρ3 , ρ4 }. This includes the case of non f-labelled edges out of y (ρ3 = ρ1 ∧ f 6= g ). 74 4.6. MAPPING TO JAVA As above, 〈nL 0 (ρ3 ) , g , nL 0 (ρ4 ) 〉 ∈ Ês Going from δ → ∆ → ∆0 =⇒ 〈nL (ρ3 ) , g , nL (ρ4 ) 〉 ∈ Es =⇒ 〈nL (ρ3 ) , g , nL (ρ4 ) 〉 ∈ Es0 =⇒ 〈nL 0 (ρ3 ) , g , nL 0 (ρ4 ) 〉 ∈ Es0 no change to edge in ¹x = y. f º∆ no change in end-point labels for ρ 6= ρ2 0 Hence, e = 〈nL 0 (ρ3 ) , g , nL 0 (ρ4 ) 〉 ∈ Ês =⇒ e ∈ Es0 III. (Proving Σ̂ ⊆ Σ0 ). Let ρ1 = σ(y) and 〈ρ1 , f , ρ2 〉 ∈ H . No object typestates are modified in a getfield operation. That is, ς(ρ2 ) = ς 0 (ρ2 ). (a) Consider ρ2 . There are three sub-subcases depending on ς(ρ2 ) i. ς(ρ2 ) = r, it follows that ς(ρ1 ), due to the deep properties of safe. Going from δ → ∆ → ∆0 =⇒ Σ(nL (ρ2 ) ) = r δ →∆ 0 0 =⇒ Σ (nL (ρ2 )∪{ x } ) = w t Σ (nL (ρ1 ) ) from [5] in ¹x = y. f º∆ =⇒ Σ0 (nL 0 (ρ2 ) ) = w t r both ρ1 and ρ2 are safe 0 =⇒ Σ (nL 0 (ρ2 ) ) = r ˆ Going from δ → δ 0 → ∆, ς 0 (ρ2 ) = r, because typestate unchanged =⇒ Σ̂(nL 0 (ρ2 ) ) = r ii. If ς(ρ2 ) = w, the reasoning is similar to the one above. iii. If ρ2 = null, then there is no corresponding edge in Es. It follows that Σ0 (n x ) = null (b) For all objects other than ρ2 , the typestate is the same as before. In all cases, we see that ς(ρ) ≤ Σ̂(nL (ρ) ) ≤ Σ0 (nL (ρ) ) Case 5 β(¹x = m(y)ºδ (δ)) v ¹x = m(y)º∆ (β(δ)), where δ = 〈σ, H , ς〉 The safety properties of the S-CALL rule follow from those of kill and scluster; as long as the set of variables to be invalidated is conservative, kill takes care of both safety and accuracy of the abstract heap model; we have sacrificed precision (detailed information) of the heap model in favor of certainty, without losing accuracy. As we pointed out earlier, in the limit case, one could invalidate all variables and have the post-call state to be accurate (only n; is left), but completely useless. We have already seen that scluster offers us the necessary coverage of all the nodes that could refer to objects whose state cannot be predicted (Defn. 4.7, pg. 62). 4.6 Mapping to Java Java classes identified by the marker interface Message are treated as message types. That helps it keep internal objects separate from message objects, as long as the type checker is careful to maintain the original class even under casting. CHAPTER 4. TYPE SYSTEM 75 All the restrictions imposed upon message types have to do with enforcing ownership protocols and containing aliasing. There are a few categories of objects that do not fall into our definition of a message. 4.6.1 Sharable types Immutable value types such as String and Date ought to be freely exchangeable. However, simply being immutable is not enough, if the object is treated as a resource. For example, a security capability object may need to be strictly accounted for and passed carefully from one subsystem to another;59 such a scheme would need to deal with aliasing and ownership, even if it is implemented as an immutable Java class. As a side note, it is surprisingly difficult to glean from an examination of the bytecode whether or not a class is immutable. Most classes such as String and Date are known as immutable, but in reality are observationally immutable: not all fields are marked final or computed in the constructor. Some, such as hashcode are lazily computed and cached. Some classes such as Java’s java.net.URL permit inner classes to memoize its state. Secondly, fitting into an existing thriving ecosystem such as Java means that there will be a number of classes that we would like to exchange (Socket), but which are not Kilim message types. For these two reasons, we require that the programmer make explicit to the type checker that a type is sharable, via a marker interface called Sharable. Any class inheriting from Sharable is given a free pass by the type checker; a sharable object can be treated as a message, but is not subject to any constraints of aliasing or ownership, or the types of its fields. Objects obtained from fields of Sharable objects are treated as regular Java objects, unless they too are instances of a Sharable class. For this reason, it is best if the fields of such an object are encapsulated and made accessibly indirectly via thread-safe methods. Finally, the weaver limits static class fields to constant primitive or final Sharable objects and prevents exceptions from being message types. Clearly, the Sharable annotation is a safety loophole, but a necessary compromise. For existing classes whose sources cannot be annotated, but whose behavior is known, the programmer can supply “external annotations” to the weaver as a text file: 1 class java.lang.String implements Sharable 2 3 4 5 interface java.io.PrintStream { void println(safe Object o); } 6 7 class StringBuffer implements Message This scheme works as if the annotations were present in the original code. Clearly, it only works for non-executable annotations (type-qualifiers used for validation); Pausable cannot be used as it results in code transformations. Further, externally annotated libraries are not intended to be passed through Kilim’s weaver; the annotations serve to declare the library methods’ effects on their parameters. If a method parameter is not a message type, but is type compatible (upcast to Object, for example), then the absence of an annotation is treated as an escape into unknown territory; the weaver treats it as a compile-time error. 76 4.7. RELATED WORK 4.6.2 Arrays Arrays are another kind of object that we would like to be able to exchange. The Kilim weaver treats an array of primitive types, Message and Sharable types as a potential message. If an array is obtained from a message object, it is subject to message verification rules. Expressions of the form x[i] are simply treated as getfield or putfield expressions of a special field type [·]. To simplify the type checker, we provide a special destructive read 32 function called cut that reads a field or an array slot and nullifies it in one atomic operation, similar in sprit to Java’s post-increment operator. That is, x = cut(y. f ) x = cut(y[i]) ≡ ≡ x = y. f ; y. f = null, and x = y[i]; y[i] = null In both cases, the function yields a free object (assuming the array is an array of messages) 4.6.3 Collections and Iterators Collections and iterators have traditionally been a problem for any type system that enforces ownership and encapsulation protocols, including linear types, ownership types and universe types. One issue is that an iterator object results in multiple aliases, and is distinct from the collection itself. The other issue has to do with return values. Once a value is returned, the function exercises no control over it. This poses a problem for strict ownership transfer protocols, including ours. Having transferred an object to a collection, a get or lookup method cannot simply return a contained object without losing control. The obvious option of removing the object from the collection and returning it untethered is not practical. Our approach is to avoid iterators and return values altogether. Instead, we provide a few special collection classes for messages (hashmaps, lists and sets) that can be supplied actions; the introduction of closures to the Java source language should clean up the syntax considerably. 1 2 3 4 5 m = new MsgHashMap(); m.put(key, value); ... m.doWith(key, new Action(){ void action(@part Value v) { 6 7 8 9 10 11 12 13 4.7 } }); m.iterate(new Action() { void action(@part Entry e) { Key k = e.key; Value v = e.value; } }); Related work Pointer aliasing is widely used for efficiency. Hackett80 describes nine ways in which aliasing is used in systems software. The problems raised by aliasing93 are significant enough to have spawned a rich CHAPTER 4. TYPE SYSTEM 77 vein of research, and in this section we briefly categorize the approaches and discuss the highlights in roughly chronological order. What differentiates our work from extant literature is not so much the mechanics of the analysis as in our particular design points: to recap briefly, making messages data-oriented (unencapsulated C-like structures), requiring a tree-structure for ease of analysis and for run-time speed, and providing deep implicit capabilities for safety and ownership transfer. Much of our analysis framework can be used in other areas that treat objects as resources, areas such as secure information flow and resource accounting of security capabilities and database connections, for example. Our desire to require minimal changes to the Java toolchain resulted in the Kilim weaver, a post hoc analysis tool. Matthew Papi’s work on pluggable types for Java140 uses annotations as type qualifiers, and allows the Java compiler to be enhanced with user-defined types. This facility is expected to be a standard part of the compiler in future versions of Java.101 Unfortunately, in the absence of type inference machinery in the Java compiler, we feel this approach only serves to add to the syntactic burden. Immutability, Readonly When objects are immutable, low-level data races are non-existent (and higher-level data coordination is easily handled using a transaction or workflow coordinator). However, unless a language and environment makes it the default from the ground up (as in Clojure and Haskell), one has to address the mutable aspects of the language. Boyland33 highlights the complexities of different schemes to add immutability to object-oriented languages. Many attempts179 to fix the shallow nature of Java’s final and C++ const by providing transitive readonly semantics are flawed because they cannot easily mix immutables and mutables: an immutable container of mutable items, for example, or vice-versa. The other problem is that of observational exposure. When a variable (and not the underlying object) is typed readonly, and if it is possible to modify the object via some other variable, the original readonly variable can see the changes as they are happening. Worse, partial updates may be viewable in concurrent systems. This is the reason why we have safe as a temporary readonly qualifier, and why variables deemed to be referring to possibly inaccessible objects are rendered invalid. Pointer Analysis We originally started with static alias detection and verification in order to not require any programmer annotation. Precise alias analysis is undecidable,148 and every analysis must necessarily trade speed for precision and coverage. There is a vast amount of literature available on pointer analysis of C programs,88 that ranges from fast and imprecise analyses such as Steensgard168 and Andersen,8 to slow, but precise shape analyses.154 The solutions differ on whether they are flow-sensitive and context-sensitive, favor local or global whole-program analysis,187 offline vs. online89 and choice of implementation representation (constraint sets, BDDs187 ), etc. Many design choices are also based on the application for which they are intended, such as purity and escape analysis.45, 46, 155 The key problem with most alias detection schemes, even given enough time to do whole program analysis, is that they are necessarily limited in their precision in the presence of unrestricted aliasing, multi-threading and shared memory. For this reason, most post hoc analyses (whether online or offline) are used for optimization purposes, not for building correct, modular systems from the ground up. In our experience, even with annotations for isolation and consumption semantics, we found that many analyses based on pointsTo sets were too imprecise. As Sagiv154 et al. point out, these analyses create abstract heap nodes based on their allocation point — all objects allocated at line l are associated with an abstract node n l . This is equivalent to fixing the set of run-time objects that a variable 78 4.7. RELATED WORK may point to, which renders them unable to provide strong nullification and disjointness preservation guarantees. Shape analysis,154 on the other hand, is much more precise but can be extremely slow (of the order of tens of seconds for a single file). For this reason, we limit the problem to shape enforcement, thereby making it into a post hoc type system and verification scheme. We add consumption semantics, nullreference tracking and function calls. A less cumbersome alternative (from the programmer’s point of view) might be to extend the inference to package boundaries using a limited inter-procedural shape analysis. In future work, we intend to study the unified memory analysis scheme of Marron122 et al. , as it claims to be significantly faster than intra-procedural shape analysis. Separation logic Separation logic151 extends Hoare logic90 with logical separations of the heap (this is a direct influence on the heap separation model in the KAM). It provides spatial connectives to reason about parts of the heap in isolation, and rules to reason about merged heaps. Much of separation logic work is devoted to low level C-like languages with address arithmetic and no automatic memory management; Matthew Parkinson’s PhD thesis142 applies separation logic to provide local, modular reasoning for abstract data types and objects. As with any logic based approach (and unlike type and annotationbased approaches), separation logic provides a rich alternate language to make assertions about several variables together and about data structures at any point of the program (“x is a tree, and x y”); the flip side is that at its current level of evolution, it appears that the speed of analysis and the level of programmer involvement required are still impediments to widespread industrial adoption (esp. when the structures become more complicated than trees, and when concurrency is involved). As the limitations of post hoc alias detection and verification schemes became clearer, and as we discovered how brittle many analyses are in the face of evolving software, we shifted our approach to static and modular enforcement of aliasing and ownership behavior. Linearity and Uniqueness If an object is transferred linearly from one variable to another, aliasing is impossible, and at any point, the holder of the linear object has complete rights over it, including deallocation. Memory management — a prime mover behind much research on linearity — belongs to the general category of resource management. Approaches to linearity depend on whether they consider the object to be used once, or whether there is only a single extant reference to it at run time. Gerards’s linear logic73 and Lafont’s use of it for resource consumption and accounting109 has been used by type theorists to provide linear types.183 Linear types prevent aliasing altogether by ensuring that a reference is consumed (like C++’s auto_ptr) when it is passed as a parameter or when it occurs on the right hand side of an assignment. Because no aliases exist, a local analysis is enough to show whether the object can be garbage collected, and objects can be updated in-place .17, 18 Clearly, it would be too onerous to give this treatment to every single type, which means a distinction between linear and non-linear types. This leads to the question of whether a linear type can contain or refer to a non-linear type, and vice versa. Wadler183 provides a let! construct that permits a non-linear type to be treated as a linear-type in exchange for temporarily surrendering all aliases to the non-linear type. Paclang58 is an industrial example of using uniqueness and transfer of ownership. It is a simple imperative, concurrent language for high speed packet processing and handling, and is a simplification of Kobayashi’s quasi-linear types105 (no higher order functions). It defines a special datatype called packet CHAPTER 4. TYPE SYSTEM 79 (an array of bytes) that can be transferred between memory banks and processors. At any one time, a packet may have multiple references, but only within a single thread. The notion of treating a message as a distinct entity was a seminal influence on Kilim. Kilim’s message type is an inductive structure with named fields and separately identifiable components (with corresponding aliasing issues). Guava15 is a dialect of Java that prevents data races by separating objects into three categories: unsharable objects (cannot synchronize on them), Monitor objects that can be shared by reference, and immutable Value objects which are deep-copied on assignment. A move operator allows values to be transferred without copying. Adding uniqueness to Java-like type systems (those without inference) requires one to explicitly declare uniqueness everywhere it is used, for example, unique Msg x = y.f, In addition, older approaches required unique variables to be destructively read, where the right hand side is nullified to atomically prevent aliasing. Both of these points add to the syntactic burden; Boyland’s alias-burying32 scheme adds inference and delayed destructive reads via a post hoc static analysis that uses annotations equivalent to whole and safe on method signatures. We followed their lead in this respect. Regions From eliminating aliasing altogether, we next examine proposals to permit aliasing in a controlled fashion. The key idea is to have named regions that can group objects, and to track aliasing at the level of regions.17 In Tofte and Talpin’s type and effect system,177 each object belongs to a region. New regions can be allocated and deallocated dynamically, and region lifetimes are associated with lexical scope; for this reason, memory management is handled at the region level. Region names can be passed polymorphically to functions and since regions can be deallocated, dangling references are prevented by requiring functions to state which regions they use, and how. One of the best known examples of a region based memory management system is Cyclone,77 a dialect of C. This notion of automatic or compiler-generated names as equivalence classes of objects occurs repeatedly in various guises, such as owners in ownership types and guards in capability systems, etc. We examine a number of such proposals next. Objects and Encapsulation Encapsulation of state is one of the cornerstones of object orientation and seen as an important tool towards reining in aliasing. The notion of private and protected access in object oriented languages merely protect a single object, but not networks of objects. Encapsulation approaches differ on how they draw boundaries around networks of objects — effectively marking a region — and the restrictions on aliasing that they place across the boundary. Hogg92 proposed a scheme called Islands. An island is a cluster of objects, and connected to other islands via bridge objects (similar to Kilim mailboxes). An ordinary object is invisible from outside and may only point to a bridge or to other objects in its island. The method parameters of bridge objects must be qualified as free or read, the equivalents to our whole and safe, thereby guaranteeing that what goes in and comes out of an island is unaliased. This makes it possible to have Hoare-style specifications on the bridge’s methods. The read mode has the problem of observational exposure discussed earlier. Almeida’s Balloon types7 has similar support for strong encapsulation, but relies on a static analysis instead of a type system to reduce the syntactic burden. The encapsulation requirements of Islands and Balloons have been considered too strict by researchers. 80 4.7. RELATED WORK Confinement types180, 191 treat the package as a natural boundary, allowing package-scoped classes access only to objects within the same package or in nested packages. The encapsulation achieved here is necessarily weaker, and the problem of concurrent modification is orthogonal to it. The Universe type system56 by Dietel and Müller has the notions of rep and peer. A rep object is part of the representation of its owning object and is fully contained within. Ownership transfer was addressed in a later extension.129 Two objects are peers if they have the same owner (cons cells in a linked list, for example). An object can only modify the fields of its own rep objects or that of a peer, commonly called as the owner-as-modifier approach. All other references are marked as readonly, in that a non-rep or non-peer field cannot be used to effect a modification, so although aliasing is permitted, the proposal limits the places where a given object can be modified. This scheme has the problem of observational exposure. Flexible Alias Protection133 addresses the read-reference concept differently. Objects that cross an abstraction boundary — called argument objects — can be passed in a special kind of read mode called arg-mode; a reference in this mode can only view immutable parts of an object, thus accommodating aliasing without compromising safety. The proposal combines this with other modes we have seen before: free and rep. Aliasing is either limited to related internal objects, or must be in the benign arg form to be made public. We decided not to add this facility to Kilim, because we wanted to explore the ramifications of isolation and ownership transfer. Treating all reachable objects as nested or owned objects in need of encapsulation is untenable. After all, a collection and its elements are independent. Ownership type schemes pioneered by Clarke47, 49 draw an encapsulation boundary using a dependency relationship. When an object directly modifies a field of another object or calls a method that modifies the object’s state, the former is said to depend upon the latter. By this definition, a list does not depend upon, or own its elements. In ownership types, every object has an owner, and its type is parametrized by an owner name (similar to regions). Owner names are supplied as polymorphic identifiers on class and method declarations. For example, the type Foo refers to all objects of class Foo and owned by the current object; in effect, it declares a new type Foo for each owning object, which fixes the owner of an object for its lifetime and prevents transfer of ownership altogether. The type Foo refers to an ownership parameter bar obtained from a class or method declaration such as this: class C{...}. This permits an object to refer to objects owned by other objects, and not mix one owning class with another. Alias containment is achieved by ensuring that an owned object can never be public or reachable from a return value; this forces access to an object to always go through its owner. This is known as the owner-as-dominator approach, in contrast to the owner-as-modifier property of Universes. Ownership type schemes do not handle iterators, because they are conceptually part of the collection but also publicly visible at the same time. Boyapati’s SafeJava31 adds where-clauses to further specify and constrain ownership relations. This, combined with a principled exception for inner classes addresses the iterator problem. Ownership types as described so far have no idea about threads and unsafe access. Clark and wrigstad,48 and subsequently SafeJava, combined ownership types with unique references to effect ownership transfer. The systems prevent variable aliases to unique references by providing explicit destructive reads: x = y is equivalent to x = y; y = null. To temporarily relax the uniqueness restriction, the systems provide a borrow construct that introduces a lexical scope where a unique variable may CHAPTER 4. TYPE SYSTEM 81 be aliased. The aliases are destroyed and uniqueness is regained at the end of the scope. Unique references and ownership types together yield an external uniqueness property, where it is possible to ensure that a message object (for example) is uniquely transferred from one subsystem to another, while the ownership type scheme preserves OO encapsulation. The problem with most ownership type schemes in practice is the syntactic burden, even with SafeJava-style type inference and well-chosen default annotations. Since each object must have an owner, an abstract data type must accommodate a corresponding owner parameter name for each object seen in its interface. For example, consider a Pair class below: 1 2 3 4 5 6 class Pair where oa  owner, ob  owner { Object a; // a owned by oa Object b; // a owned by ob } Streamflex164 and later ScopeJ190 provide an implicit ownership type system for the real-time Java setting. Because memory allocation is crucial and real-time garbage collection is frowned upon, the scheme relies on lexical scopes to combine ownership and regions, which in turn means that ownership is implicitly tied to the lexical scope. Capabilities and Permissions We next examine capability-based systems.52 The key principle here is to separate possession of a pointer from the capability to use the pointer. The static analysis tracks the type environment separately from the capability environment, and a pointer can be used only if there is a capability to use it. Such systems may even permit dangling references, and guarantee safety by retracting the corresponding access capabilities. Further, distinction is drawn between a linear capability, which indicates exclusive ownership and permission to deallocate a region, and a non-linear capability for reading and writing. The “Capabilities for Sharing” proposal by Boyland 34 et al. unifies uniqueness and immutability. It associates pointer variables and fields with a set of capabilities: basic rights to read and write fields, exclusive rights to deny other variables the right to access the same object, and finally, an ownership right, which effectively makes a variable the exclusive owner of all rights to an object, and also the meta-capability to deny rights to other aliases. AliasJava6 is a dialect of Java that combines alias annotations for uniqueness (unique, shared), and ownership (lent, owned). These annotations are applicable at all type declaration points, but the implementation infers annotations using an intraprocedural constraint analysis. AliasJava provides alias control mechanisms that are oblivious to concurrent usage, and data races are possible. For example, it is possible to use a unique reference even while the reference has been lent out. Typed Assembly Language127 and later, Alias types161 target lower-level compiler intermediate languages where memory management is important. Here, single objects are tracked (not regions) and pointers refer to abstract location names. Crucially, the capability and type environments are intermixed: while a pointer variable mentions a location, a capability mentions both the location and type of its contents. For example, a pointer variable’s type is denoted by ptr ρ and the associated capability is {ρ 7→ τ}, which denotes a linear ownership of an object ρ whose contents are of type τ. The linear 82 4.7. RELATED WORK capability allows the owner to change the underlying type of the object, as would be necessary in a memory manager; with a single change, all pointers pointing to that object point to a different object. Non-linear capabilities are denoted by {ρ 7→ τ}+ . Alias types were later enhanced with regions,184 and existential types to anonymously name regions. For example, the type ∃ρ.ptr ρ denotes a pointer to some region with some name ρ. If this were a pointer to a pair, one would say ∃ρ.[{ρ 7→ 〈τ1 , τ2 〉}]ptrρ, which packages exclusive capabilities to a tuple inside an anonymous region. Semantic rules to pack and unpack such hidden names were added using a notion of packing and unpacking the hidden capabilities. This scheme allows one to describe recursive data structures and embedding of linear types into non-linear ones. Alias types were a seminal influence on languages such as Vault55 and Sing# .60 Using existential types has the problem that each anonymous region generates its own unique type. Fähndrich and Deline61 introduced adoption and focus to fix this problem. Here, an object is born in its own region (and being a singleton, is linear as well). Adoption allows a linear object to be adopted by another region; it thereby loses both its region and its linear capability. The focus operator provides a lexical scope which allows one to focus on a particular object, while barring any aliases of that region from being usable within that scope. When the scope of the focus operation ends, the region is back to being non-linear. In recent work, Haller and Odersky82 build on the above concepts to provide concurrency-safe external uniqueness and also represents an excellent mix of the state of the art and programming convenience. An @unique annotation on method parameters forces the caller to forgo the reference. This is coupled with expose, a feature similar to Vault’s focus, that provides a block inside which modifications can be done and arbitrary internal aliases can be created, but references from outside the block are prevented from referring to an exposed object’s innards. By temporarily revoking uniqueness capabilities to a unique object, it prevents that object from being viewed simultaneously by two different threads. Their system is less general (and consequently simpler) than Vault on one count: they do not provide the parametrizable region naming scheme seen in Vault; instead, they enforce the rule that any two exposed parameters supplied to a method must belong to the same region. Kilim architecture and implementation 5 The Kilim runtime consists of components typical of user-level threading infrastructures: lightweight threads, user-level schedulers, communication and synchronization mechanisms. The differences from existing frameworks stem mainly from our focus on a server-side engineer’s checklist. This chapter presents some of the important design choices and engineering techniques, in the context of the checklist. We start with a summary: 5.1 Performance & scalability Threading optimizations (§5.1.4) and results (§6.1) Kilim I/O library (§5.4), and results (§6.2) Scheduler Hopping (§5.3.1) Messaging types to pass messages by reference Simple serialiazation of messaging types Portability JVM-independent lightweight threading Decoupling Mailboxes for control isolation Messaging types for heap isolation Safety Messaging types Overload control Bounded buffers in mailboxes (§5.2) Staging support with multiple schedulers (§5.3.1) Multi-processor support Thread-pool control with schedulers, thread affinity Blocking & pausing support in mailboxes (§5.2) Multiple schedulers, scheduler hopping (§5.3) Fault-tolerance Actor-linkage, watch-dog and mailbox timers Monitoring Interceptors in all communication constructs Simplicity Automatic stack management (linear control flow) Preemption Watch-dog timers (§5.1.5) Actors and Fibers A Kilim actor is essentially a combination of a lightweight thread and a private heap, but these are not notions that the user needs to care about. 84 5.1. ACTORS AND FIBERS package kilim; class Actor { public void start(); public @pausable void yield(); public @pausable void sleep(long ms); public void informOnExit(Mailbox mb); public ExitMsg join(); public ExitMsg joinb(); public @pausable execute() throws Exception {} } Applications define a subclass of the Kilim Actor class and override the execute() method, which serves as the starting point (like a program’s main or a Java thread’s run method). The start() method is equivalent to the spawn instruction in KAM. The type system ensures that an Actor’s constructor and public methods only message or value type paramters. We will explain the join, informOnExit and joinb methods later. The rest of this section addresses three requirements: multiplexing hundreds of thousands of actors onto a handful of kernel threads, portability and preemption. The Kilim framework has been written in a way that it can easily be ported to a VM that provides very lightweight user-level threads; in such a case, most of this section (except preemption) would be unnecessary. However, there is much more to an actor than just user-space threading, as the rest of the chapter will demonstrate. Most JVMs do not permit stack inspection (user code cannot examine the calling method’s variables and operand stack, for example), leave alone permit stack switching. The only portable alternative is to transform bytecode to have each actor cooperatively unwind itself and leave the kernel thread free to rewind some other task in its place. The transformation is a simple variant of single-shot delimited continuations.37 The Kilim actor implementation uses such a facility, provided by the Fiber class. Here is a brief detour on the notion of continuations before we examine the Kilim threading infrastructure. 5.1.1 Continuations: An overview A continuation is a label (program counter) coupled with some data; evaluating a continuation is the moral equivalent of a goto-with-data. Given this, a traditional procedure call is a special case of evaluating two continuations, the transfer of control to the procedure, followed by a jump back to the caller. In languages that provide first-class continuations, the compilers often transform all code into Continuation-Passing Style10 (CPS). Here, each procedure signature is transformed to include an extra continuation parameter, which contains the return label to jump back to (along with the caller’s activation frame); the last thing performed by any procedure is to evaluate the return continuation. If the procedure were to call another procedure (instead of returning), it would generate its own return continuation to pass to the new procedure, and link the continuation given to it to the new one it generated. This way, the traditional call stack with its stack of activation frames is merely a linked list of continuations going back all the way to the first procedure. By providing a general goto-with-data paradigm, continuations unify112, 170 all forms of control flow patterns including call/return, exceptions, signals, callbacks, loops, recursion, backtracking, coroutines, task switching, ICON-style generators76 etc. They correspond directly to the denotational semantics of programming languages.170 CHAPTER 5. KILIM ARCHITECTURE AND IMPLEMENTATION 85 Four decades have passed since continuations were first discussed in earnest;150 they are only just beginning to be adopted by the mainstream. Given native hardware support for procedure call stacks (register windows for parameters, stack registers), the predominance of synchronous procedure calls (including crucially the system call interface), and the stack-oriented mentality (stack traces for debugging, stack inspection used by the Java security framework), there is even less incentive for most mainstream systems implementations to provide first-class support for continuations. Further, experience with Scheme’s call-with-current-continuation (call/cc) has shown them to be too inefficient and error-prone in practice, especially so in the presence of mutable global state and hardware parallelism; evaluating a continuation may involve a scheduling decision as well. Our original interest in continuations was to use them for asynchronous handoff of messages between protocol layers, without building up a stack or holding the caller’s stack hostage. This is a problem central to all reactive, event-driven frameworks such as GUIs, timer services, signal handlers, network I/O handlers, and interrupt service routines. If the callback routine (the handler) in any of these cases fails to terminate (or just takes its own sweet time), it runs the danger of stalling the entire service of which it is a part. Our second interest was in providing ICON-style generators, an elegant generalization of iterators. The Kilim framework uses a restricted form of continuations that always transfers control to its caller but maintain an independent stack. It satisfies the goals above with the following architectural choices: Suspend-Resume. We preserve the standard call stack, but provide a way to pause (suspend) the current stack and to store it in a continuation object called Fiber). The Fiber is resumed at some future time. Calling Fiber.pause() pops (and squirrels away) activation frames∗ until it reaches the method that initiated resume(). This pair of calls is akin to shift/reset from the literature on delimited continuations; they delimit the section of the stack to be squirrelled away. Although Kilim fibers are not part of the public API, they have been used by other projects.1, 136 The entity responsible for resuming the Fiber differentiates the next two points. Schedulable Continuations. Kilim actors are essentially thread-safe wrappers around Fibers. A scheduler chooses which Actor to resume on which kernel thread, thereby multiplexing hundreds of thousands of actors onto a handful of kernel threads. Kernel threads are treated as virtual processors (similar to the scheduler activations9 approach) while actors are viewed as agents that can migrate between kernel threads. Generators. The Kilim framework also provides generators† , essentially suspendable iterators. When resumed, they yield the next element and promptly pause again. Generators are intended to be used by a single actor at a time, and run on the thread-stack of that actor, which means that although the actor is technically running, it is prevented from executing any of its code until the generator yields the next element. 5.1.2 Automatic Stack Management The Kilim toolkit provides a tool called a weaver that transforms the bytecode of each method along the call chain, as shown in Figure 5.1. This is a well-known technique,66, 96, 152, 175 and the differences are ∗ † Also known as stack-ripping2 or trampolining. equivalent to Python’s yield. 86 5.1. ACTORS AND FIBERS Original 1 2 3 4 5 6 g(int n) { for (int i = 0; i < n; i++) { h(); } ... } 7 8 9 10 Woven 1 2 3 4 5 6 7 h() { ... } 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 g(int n, Fiber f) { // prelude if (f.isResuming) switch(f.pc) { case H1: i = 0; n = 0; goto H1; ... } for (int i = 0; i < 10; i++) { H1: // pre-call f.down(); h(f); f.up(); // post-call if (f.isPausing) if (!f.hasState) f.state = new State_I2(i,n); f.pc = H1; } return; } else if (f.hasState) State_I2 st = (State_I2) f.state; i = st.i1; n = st.i2; } } ... } 25 26 27 28 h(Fiber f) { ... } Figure 5.1: Kilim code weaving: Providing support for suspending and resuming the stack. Note that the transformation is on bytecode, hence the presence of goto statements. primarily in some of the optimizations described later. We present below the gist of our transformation (unless otherwise noted, all line numbers refer to the right hand side of Fig. 5.1). First, the method’s signature gets an extra argument called a Fiber. This object is used as a peg to hang saved information for a subsequent restoration. The method’s entry is patched with a prelude section, essentially a set of jumps to each call site. Finally, each call site (line 3 on the left) is sandwiched with pre- and post-call sections. Let us look at stack suspension first (the post-call section in Fig. 5.1). Once h returns, g examines the fiber object to see whether the call returned normally or if it signalled that it wants to pause. If the latter, line 14 squirrels away to the fiber the information that it will need upon resumption (i, n and the code offset to return to H1). We use a canonical class tailored to the call site to store the information: in this example, State_I2 has room for exactly two integers. Once g returns, its caller does the same style of post-call processing, and so on all the way up the call chain. Stack resumption is split between the prelude and post-call sections. The prelude restores the program counter and jumps to the call site that earlier triggered the suspend. In the post-call section, if the fiber signals that h() returned normally, the state is restored (line 19) before moving on to line 23. CHAPTER 5. KILIM ARCHITECTURE AND IMPLEMENTATION 5.1.3 87 JVM constraints The JVM’s verifier and the threading-related constructs place a number of other constraints on portable stack switching. This section catalogues some of the major issues. Verifier Unlike C or machine language, one cannot simply insert a jump instruction; the JVM verifier performs liveness and data type analyses to ensure that for each instruction in the method, the live local variables and operand stack have the same data type and that the operand stack has the same depth, regardless of the path taken to get to that point. The Kilim weaver performs the same analysis as the Java verifier and ensures that the frame’s local variables are initialized with default values (corresponding to the variable’s type) along newly introduced edges in the control flow graph (Line 4 in Fig. 5.1). Allocation and Initialization The new expression (say, new Foo(m())) in Java is split into allocation, argument evaluation and the constructor call at the bytecode level: t1 = new Foo // allocate instance t2 = m() t1.(t2) // constructor call The verifier prevents jumps in and out of sections between the new and the corresponding constructor call(). Observe that the transformation of Fig. 5.1 violates this constraint: the prelude jumps directly to a call site without regard to the target label’s context. Our simple solution is to pull the allocation and initialization parts into a separate helper method (there is no run-time overhead since it gets inlined at run time): t2 = m() t1 = new_init_Foo(t2) private static Foo new_init_Foo(y) { x = new Foo(); x.init(y); return x; } This solution performs local surgery on the original bytecode, yet easily handles complicated expressions with conditional constructs such as this: new Foo(check? m(new Bar(n())): new Baz().n()) It is worth noting here that we explicitly disallow calling pausable methods within constructors (that is, a constructor cannot be marked pausable), because an object should not be receiving messages until it is fully initialized. Java subroutines The Java VM has the notion of a local subroutine, originally intended for code in finally blocks to be callable from both the try and catch sections. There are two problems with this construct: (i) it prevents normal intra-procedural analysis techniques because, by definition, it is a separate subroutine and (ii) a subroutine’s bounds are nebulous — the Java specification informally defines a subroutine as the set of instructions reachable from the jsr instruction (jump to subroutine); this could legally encompass the whole method. Most compilers no longer emit this instruction and 88 5.1. ACTORS AND FIBERS prefer to simply duplicate the finally block; the Kilim weaver follows suit, by inlining the transitive closure of instructions at the subroutine’s call site. Thread Affinity The monitorenter and monitorexit instructions are thread-affine and would require a fiber to be restored in the same kernel thread as the one in which it was reified. We take the simpler alternative of prohibiting pausable calls within synchronized blocks, with the intent to discourage sharing. However, we do provide a run-time way of pinning an actor to a particular thread, in order to coexist with libraries that rely on the kernel thread’s identity. This is distinct from higherlevel logical locks built atop messaging primitives to synchronize and sequence actor operation. 5.1.4 Optimizations We have thus far discussed reasonably well-known techniques for switching stacks portably in Java. This section describes a number of performance-oriented optimizations that set Kilim apart from competing approaches. Pausable methods. To avoid transforming all code, we introduce a Java method annotation called @pausable that helps the programmer mark those methods that wish to suspend and resume. This annotation is an interface contract similar to a checked exception; a method is pausable if it calls or overrides or implements a pausable method. The side benefit of this explicit annotation is that it alerts the programmer to the potential cost of stack switching (and hence puts a noticeable back-pressure on the design). As Adya et al.2 noted, such an explicit marker also helps with software evolution, because one cannot suddenly turn a non-pausing method into a pausable one without the caller being affected. Return, not throw. Some researchers have used exceptions as a longjmp mechanism to unwind the stack; we use return because we found exceptions to be more expensive by almost an order of magnitude, especially because they have to be caught and re-thrown at each level of the stack chain to reify the call chain. Further, an exception clears the operand stack, which forces one to take a snapshot of the operand stack before making a call; in our experiments, lazy storage and restoration performs better. Lazy storage and restoration. Figure 5.1 (line 14) shows that the frame is captured only when the callee signals a suspension (and then only if it was not captured in an earlier iteration). It is restored to its original values only when control needs to be passed to the original code following the invocation (line 23). Note that the control plane (the program counter) is restored eagerly in the prelude (with dummy values for the data to keep the verifier happy), but the data plane is restored only when absolutely required. Our experience with Kilim-based projects (Chap. 6) has been that intermediate frames in a call chain tend to get used far less than the topmost frame, and it is frequently unnecessary to eagerly initialize the local variables with the stored values. In general, mainstream programming languages and compilers do not offer tail-call optimization (TCO), nor are programs written in a way to make use of TCO even when available. As a consequence, call chains tend to be large (fifty deep is quite common in Java-based servers). Storing and restoring such stacks would undoubtedly drag performance down, but as it turns out, the actor architecture never encounters this problem. This is because subsystems are chopped up into actors with their own private CHAPTER 5. KILIM ARCHITECTURE AND IMPLEMENTATION 89 stacks. For example, a typical call chain in a Java server contains layers of methods related to socket handling, parsing, message dispatching, and application code. The equivalent actor architecture splits these layers up into several actors that suspend and resume independently and at different frequencies. Parameters, not thread-locals. Some approaches96 do not modify the signature of each pausable method, preferring instead to pass the continuation object as a thread-local variable that associates information with the kernel thread. While the approach is syntactically cleaner, we found that the thread-local facility in the pthreads library and in the present implementations of the JVM is about 10x slower than passing the fiber object as a parameter (because they involve hash-table lookups). This has considerable impact when each method in the call chain is forced to use it. Of course, a faster thread-local facility (such as using the x86 segment registers) renders moot the performance advantage. However, by changing the signature, we use the JVM’s verifier in ensuring that unwoven code is never loaded by accident. This too is a non-issue if a classloader incorporates the weaver and prevents the presence of unwoven code altogether. Constant and Duplicate propagation. The Kilim weaver’s static analysis not only tracks which registers and stack elements are live at any point in the code, it also tracks which of them have duplicated or constant values. These variables (if live) can be initialized and restored using embedded code, instead of relying on the State object. 5.1.5 Cooperative Preemption The Kilim actor class provides two methods called sigKill() and sigYield(), which send internal notifications to the actor via a boolean flag (marked volatile) in the fiber. This flag is checked on entry to every pausable method. An actor can provide a custom implementation of this method, but as with Unix signal handling or any multi-threaded code, must ensure that the data is accessed in a thread-safe manner and that the overridden method does not itself block indefinitely. One could also take the Erlang approach of letting an actor die, and let a hierarchy of supervisor objects bring the system back to a consistent state. This facility provides a measure of cooperative preemption; as long as the actor calls a pausable method, this flag is guaranteed to be checked. This of course does not prevent a runaway actor. It is trivial for the weaver to add a check at every back-edge of a loop, but in our limited experience, runaway actors are rare enough that it may not be worth burdening every loop of every method. 5.1.6 Lightweight Threads: Related Work There are any number of solutions for lightweight cooperative task management, even within the Java world. To our knowledge, none of these solutions provide any support for preemption. Adya et al.2 observe that the notions of cooperative tasking and automatic stack management are often conflated. There are many Java solutions that belong solely to the former, such as Java’s built in Executor service, Doug Lee’s FJTask115 ) and Cilk.25 Run-to-completion services are useful especially in fork-join parallelism (such as those found in MPI applications), but are not general enough for longlived ad-hoc workflow tasks. The Capriccio project182 modified the user-level POSIX threads (pthreads) library to avoid overly conservative pre-allocation of heap and stack space, relying instead on a static analysis of code to size 90 5.1. ACTORS AND FIBERS the stack, and embedded code to dynamically grow the stack where necessary. They report scalability to 100,000 preemptively-scheduled threads. Meanwhile Linux sports a new threading infrastructure (NPTL: Native Posix Thread Library) with tremendously improved performance;95 we have not determined how Capriccio fares in comparison. Pettyjohn et al. 144 generalise previous approaches to implementing first-class continuations for environments that do not support stack inspections. However, their generated code is considerably less efficient than ours; it relies on exceptions for stack unwinding, and splits the code into top-level procedures, which results in loops being split into virtual function calls. The transformation also changes access modifiers: private methods become package-public for the continuation to resume a method. Kilim’s local surgery combined with lazy suspension and resumption of the data plane gives excellent performance and its preservation of the original call stack chain preserves the developer’s mental model of the control flow. There are many frameworks that transform Java bytecode internally into a style similar to ours; we examined three for which the implementation details were available: Brakes,178 RIFE,152 and JavaFlow.96 The JavaFlow project uses thread-local variables to pass the continuation instead of modifying method signatures, as we do. While clean, this approach is error-prone because in the absence of any information from the interface (or some global knowledge of all the classes in the program) it is impossible to know whether the called interface or virtual method is pausable or not. A non-transformed piece of code would not know to look for yielding returns. In our case, the verifier would refuse to load the class because of the signature mismatch. JavaFlow correctly transforms a method if it can reach a suspend() invocation. But it unnecessarily transforms all non-pausable methods reachable from there as well, leading to substantial code bloat. None of these projects do liveness or value analysis as of this writing. This means they must store all local variables regardless of whether they will be used after resumption. We analyzed some popular Java projects (Apache Tomcat, JBoss, Antlr and the JDK) and discovered only about 30-40% of the information on average is used across method invocations. Finally, none of these projects handle subroutines and constructor calls with pausable methods. We refer the reader to a more detailed comparison presented in our paper;165 importantly, the Kilim runtime performance was significantly better than any of the other portable frameworks we tested. There are also a number of projects that have attempted to reify the state of the stack for thread persistence and migration reasons; Wei Tao’s PhD thesis174 is a good account of the problem. We have chosen not to provide for thread migration because a data center typically undergoes rolling upgrades, and it is possible for one machine’s software version to differ from another. None of the schemes described in this chapter, including Tao’s work, account for this reality. Another problem is the inability to transfer unserializable state that is tied to the machine, such as socket connections, database connections and file handles. Xavier Leroy neatly sums up all the challenges of bytecode verification and provides formalizations and algorithms for doing type analysis.118 The Kilim weaver does value analysis in addition to types, and tracks duplicate values and constants. We settled on the ASM toolkit108 (in preference to SOOT and BCEL) for its speed and compactness, but used our own verification and analysis engine. CHAPTER 5. KILIM ARCHITECTURE AND IMPLEMENTATION 5.2 91 Communication and Coordination Kilim consolidates all communication, notification and coordination activities via the Mailbox class, a type-safe multiple producer, single-consumer concurrent queue used as follows: Mailbox mbx = new Mailbox(); mbx.put(msg); mbx.put(10); // Type-error MyMsg m = mbx.get(); Type-safe concurrent access is however only one of the aspects of communication and coordination. We now highlight a few design features that make Mailbox a one-stop shop for a variety of needs. Blocking and Pausing. We use the term blocking to refer to any method that can block the current kernel thread and pausing to refer to the Kilim suspension mechanism. In the current environment, all existing thread-safe libraries and system calls are oriented towards blocking the thread and it would be impractical to provide a pausable equivalent for all blocking calls. Our measurements indicated that although threads are not lightweight, they are not so heavy that one has to be utterly parsimonious in their use. For this reason, we take a somewhat relaxed attitude towards kernel thread creation. Still, there is a limit to the number of kernel threads one can create, and to help Kilim programs straddle the blocking and pausing worlds, we provide variants of get and put in the mailbox, as shown below. class Mailbox { @pausable void void boolean @whole T ... } put(@whole T msg); // pause actor until put succeeds putb(@whole T msg); // block thread until put succeeds putnb(@whole T msg); // non-blocking put (return true on success) get(); The mailbox thus becomes an adaptor that permits the producer and consumer to have independent semantics; one can block while the other pauses. This way an actor can supply or retrieve data from a chain of blocking Java I/O streams that are completely oblivious to Kilim. Type support. The listing above shows that the API is not only type-safe with respect to the structural Java type, it also carries the whole annotation for linear ownership of mutable data (recall from Chapter 3 that this annotation forces the type to be a message type, but the linearity restriction does not apply to classes marked sharable). Selection. Kilim provides simple CSP-style ALT’ing91 in the form of a select() call that resumes the calling actor when any of the n mailboxes supplied to it is get-ready. We intend to provide the full generality of Concurrent ML149 and JCSP185 (guards, prioritized selection, waiting for both get- and put-ready events and so on). Overload Control. Mailboxes have optional bounds for overload control — put() pauses when this bound is reached and resumes only when the mailbox is drained. This is useful for structuring SEDA-like stages.186 The current design requires the programmer to specify the bound, and further research is needed to help determine such a number adaptively; it is never a good idea to embed magic numbers. 92 5.2. COMMUNICATION AND COORDINATION Support for Push/Pull. In addition to blocking or pausing, the mailbox supports the traditional subject-observer pattern to accommodate sources or sinks that are neither threads nor actors. class Mailbox { void addSpaceAvailableListener(EventSubscriber); void addMsgAvailableListener(EventSubscriber); } This facility can be used to set up a chain of mailboxes, for example, each of which is an “inline” data transformer (much like a traditional protocol stack). The value of push/pull communications is illustrated beautifully in the Click modular router project.106 The disadvantage of course is that the entity doing the pushing and pulling gets its stack held up until the chain finishes processing. The Kilim project was initiated precisely due to the problems of such subsystem coupling. Interceptors. Mailbox supports arbitrary put and get hooks that allow the message to be transformed, discarded (eliminating duplicates) or simply examined by a debugger. This style of interception permits debugging or profiling probes to be inserted at coordination points, a luxury not available to most other forms of coordination (shared memory or logic variables). Coordination. Unlike other toolkits (such as JCSP), mailboxes are not central to the infrastructure; we expect developers to provide alternative synchronization schemes as well. For example, Kilim provides a Cell, a special Mailbox type with just one slot, equivalent to Haskell’s MVar. Single assignment Future (or promises) for declarative concurrency can be added trivially as a subclass of Cell. Finally, classic concurrent signalling mechanisms such as semaphores are possible as well. In all these cases, the Kilim paradigm expects the library author to provide support for the other aspects (such as support for push/pull, monitoring, naming services, linear typing etc.) as well. Exit Monitoring. Kilim allows actors to monitor the fate of other actors via a special exit message, as shown below (similar to Erlang process-linking). Any number of mailboxes can be registered with an actor or exit messages. Mailbox exitmb = new Mailbox(); actor1.informOnExit(exitmb); actor2.informOnExit(exitmb); ExitMsg m = exitmb.get(); // wait for an ExitMsg The common case of fork-join parallelism — the parent forking off a number of children and waiting for them to finish before proceeding — is handled with a join() convenience method (there are pausing and blocking variants of this method as well): a = new MyActor().start(); a.join(); CHAPTER 5. KILIM ARCHITECTURE AND IMPLEMENTATION 5.3 93 Scheduling The attractive aspect of using a user-level scheduler is to offer the application the ability to dictate the order, timing of actor execution, and the execution context (including affinity to threads and processors). The Kilim scheduler interface binds together a collection of runnable actors, a pool of kernel-threads and a scheduling policy. We treat kernel threads as virtual processors, and the supplied schedulers only create as many threads as there are processors. However, a watchdog timer monitors when no thread has become idle in a while (either because they are busy or blocked in system calls) and spawns an extra kernel thread (up to a configured limit). class MyActor extends Kilim.Actor{ ... } new MyActor().start(); // schedule on default scheduler Scheduler s = new Scheduler(10); // create a separate thread pool with 10 threads new MyActor().start(s); // explicitly assign a scheduler. Two types of scheduler implementations are supplied, a simple round-robin implementation and another that manages a network select-loop (§5.4). We are particularly interested in custom scheduler implementations with application-specific knowledge. For example, we have experimented with a scheduler that is savvy about database transactions and gives priority to those suspended actors that have been around the longest (earliest-deadline-first scheduling). Section 6.4 describes this setup and its performance profile. Another example is a network simulator, where a custom scheduler installs mailbox interceptors and dictates the logical ordering of message delivery; this way a put or get can be artificially delayed or “lost”. Given a level of indirection between logical time and real clock time, it should be possible to write applications that run largely unchanged in simulation and real settings. 5.3.1 Scheduler Hopping An application can contain multiple schedulers at run-time; they are merely independent thread pools. Actors are allowed to switch between schedulers, similar to agent migration: class MyActor extends Actor{ @pausable void foo() { ... resumeOnScheduler(otherScheduler); // yield here, // resumed in one of otherScheduler’s threads } } We use the concept of scheduler hopping in two ways. The first is to explore its use in data partitioning. The conventional approach in a shared-nothing scheme is to send messages to a database actor. The alternative is to partition the data and have each thread own its own logical slice of the data (based on key-hashing, for example). To access a piece of data, the actor uses a partition-savvy scheduler to hop to the thread that owns the data. With one call, the actor transforms itself into a database-side stored procedure with direct and exclusive access to the data. 94 5.3. SCHEDULING Another use for scheduler-hopping is to easily emulate SEDA-style staging.186 A network task can be written as a single read-process-write chain, but different parts could be handled by different schedulers. This is an easy mechanism for controlling overload (SEDA’s objective) which achieves good cache locality if threads are pinned to processors. It is possible to separate out staging as a separate, dynamically configured aspect of the application. Consider: class MyActor extends Actor{ @pausable void execute() { stage("packet_assembly"); ... // frame packet stage("parse"); ... partition = getPartition(msg.key); stage(partition); ... call database stage("output"); } } The idea is for the application programmer to identify parts of the application that can be logically scheduled separately although the code is written in a linear style. This scheme permits late binding of stage names to scheduler instances, and even dynamic repartitioning based on processor configurations, load balancing and scheduler specialization; for example, the packet_assembly and output stages may be assigned to the network scheduler (§5.4). Even if all stages were mapped to a single thread (on a uniprocessor), the scheduler could still arrange to do wavefront scheduling, sequentially moving from stage to stage resuming only those actors that are runnable in the current stage. All new actors that get ready for a particular stage are processed in the next generation (even if they belong to the current stage.) 5.3.2 Scheduling: Related Work We have found the following pieces of work instructive for what can be achieved with control over scheduling: the cohort scheduling work by Larus and Parkes,113 and the work-stealing algorithms and hierarchical task structures of the Cilk project.25 Minor Gordon’s PhD thesis75 on “Stage scheduling for CPU-intensive servers” provides a good overview of scheduling algorithms. Although every threading framework must have a scheduler, there is surprisingly little literature on application-controllable scheduling for servers, leave alone having a combination of scheduling policies for different components or aspects. Note also that it is not just the ability to replace the scheduler instance (which many provide), but the rest of the framework must provide information that the scheduler can use (such as the use of mailbox interceptors and stage names in Kilim). Dynamic mapping of stage names to scheduler instances can help researchers study the effect of switching the mix of scheduling policies at run time. We are particularly excited about hierarchical scheduling as exemplified by the BubbleSched framework.176 This framework allows schedulers to be themselves schedulable; each scheduler maps its constituent schedulable entities to a finer range of virtual processors. At the leaf is a scheduler that CHAPTER 5. KILIM ARCHITECTURE AND IMPLEMENTATION 95 maps multiple tasks to a single core. This pattern has also been described earlier as CPU inheritance scheduling.63 Constraining data access via temporal scheduling is illustrated by the FairThreads projects ,30 which has a mixture of cooperative and preemptive threads, the former managed by a fair scheduler. The implementation relies on Java threads, however. Another approach to coupling scheduling to data access is illustrated by CoBoxes,157 which partitions data between multiple CoBox instances. A task runs under the aegis of a single CoBox with complete uninterrupted access to the CoBox’s data. Nested CoBoxes are supported for finer grained parallelism. Finally, we must mention the elegant simulation framework JIST.21 Their bytecode transformation is similar to that of Kilim. Its style of advancing the logical time is somewhat unintuitive, and we believe that it is better to shift the logical timing aspects into Kilim-style staging, message passing and scheduling support. 5.4 Network Support No server framework would be complete without support for network I/O. We briefly examine some of the pros and cons of dealing with blocking and non-blocking I/O facilities, then describe the Kilim architecture: fast non-blocking I/O with automatic stack management. A socket is non-blocking if it merely returns the number of bytes read or written, instead of blocking the current thread until some I/O could take place. A thread-per-connecton architecture is simple to write because each client session can block without affecting anyone else: while (true) { // loop runs in a separate thread dedicated to a single socket readRequest(); ... process event writeResponse() } While the linear style of coding is undoubtedly attractive, this approach has not curried favour with server-side engineers because of the performance problems enumerated in Chapter 2. Instead, engineers have found that a combination of non-blocking sockets and a kernel supported selector call such as select, kqueue or epoll is a lighter, faster, more scalable alternative to dedicating a kernel thread per connection. The following style of event-loop based handling of non-blocking sockets prevails among all kinds of servers (regardless of language): 96 5.4. NETWORK SUPPORT while (true) { wait for read write or connect events for each ready socket dispatch in this or another thread (socket, event) } // called by dispatcher handle(socket, event) { // warning: do not block the thread } The handlers are essentially finite state machines that resolve the I/O event with the socket’s state. To avoid bringing the server to a sudden halt, one must take extreme care to ensure that loops and blocking system calls finish in reasonable time. This problem tends to get especially acute in generic frameworks, as seen in early operating systems, and unfortunately in many modern-day windowing and server toolkits, where a mistake by an application programmer freezes all operations. A kernel thread dedicated to each socket is clearly much simpler to write; each thread can independently block for however long it wishes. Improvements in kernel thread architecture have turned the tide: latency is very good, scalability is good enough and the speed of context switching and synchronization are low enough for certain categories of servers such as SMTP∗ , where conversations tend to be short and bursty. At the same time, event-loop based non-blocking I/O has run into rough weather because in order to use multiple processors, the event handlers are dispatched in threads other than the event loop. There are three issues with this architecture: First, sockets must be carefully shared between the threads. The strategy employed by the Java NIO library of protecting sockets with locks has problems of its own. If the event loop thread is blocked (on select), then an attempt to use a socket from a handler thread (even an attempt to close it) would block the handler thread until the selector wakes up and releases the socket’s lock. This forces the programmer to ensure that all such blocking activities are handled sequentially in a single thread (typically the event loop’s). Second, all selector events are level-triggered; if the socket is ready to read, the selector keeps reporting the event until the socket has been drained fully. Since the event will be handled at some future time in a separate thread, one is forced to unsubscribe that socket from the selector to avoid needless notifications. This requires a subsequent re-subscription for the next round, which involves waking up the selector thread; a very expensive proposition. The third problem is that the handler still does not have the luxury of blocking its thread indefinitely (because threads are considered resource intensive and are consequently pooled), which means the original complexity and inversion of control remains. Lately, the consensus is that while event-driven I/O scores well in terms of scalability and low memory consumption, it does not on program simplicity and the other performance measures. ∗ See Paul Tyma’s presentation on this issue http://www.mailinator.com/tymaPaulMultithreaded.pdf. CHAPTER 5. KILIM ARCHITECTURE AND IMPLEMENTATION 5.4.1 97 The Kilim I/O architecture One contribution of the Kilim I/O implementation is that it is possible to have it all: straight-forward sequential style (seen in thread-per-socket code), latency better than blocking I/O, and the scalability and parsimony of resource usage of event-driven I/O. It is made possible through judicious use of scheduler hopping and by amortizing the cost of selector wakeup. We first assign each socket to a Session actor, which in turn divides its time between two schedulers: (i) a selector scheduler that manages a select loop thread and (ii) a session scheduler that coordinates a separate pool of session executor kernel threads. When an actor’s socket action (read, write, connect, accept) fails, it hops over to the selector scheduler to register the socket for the appropriate ready event, then hops back to the selector scheduler when the event is ready. By itself, this registration and context-switching strategy would be terribly expensive, so we adopt the following alternative. When a socket action fails, the session actor yields (without changing schedulers), letting another session run for a change. When the session is eventually rescheduled, there is a chance that the socket is now ready to perform the transfer; if so, we have avoided the cost of scheduler hopping and event subscription with the selector. If instead the I/O action fails again, the cycle repeats, but not indefinitely. After a few such failed polling attempts, the session concludes that the socket is inactive after all and reverts to the original plan of moving over to the selector scheduler and registering itself with the selector for appropriate socket-ready events. Further work is required to adaptively determine the number of polling attempts and to ensure fairness, to prevent a busy connection from ceding control to other connections. This scheme has a number of advantages. It is highly adaptive to server load, both in terms of the traffic per connection as well as the number of connections. If the number of connections increases, the interval between successive polls of a given socket increases proportionally, which in turn increases the chances that the socket is ready to read or write (as the case may be) the next time it is polled. If the traffic on a particular connection drops, it may be moved from the active polling list to the selector scheduler. Meanwhile, the selector’s work is lighter because it only manages relatively inactive sockets. This scheme is considerably better than both blocking I/O and competing non-blocking event Java frameworks in terms of performance, scalability and memory consumption (see §6.2). A socket is not registered with the selector unless necessary, which permits a socket to be configured to be blocking, which allows the use of libraries that depend on blocking I/O. Finally, we obtain the simplicity of sequential code because the session’s logic looks identical to the multi-threaded thread-per-session version outlined earlier; the only difference is the pausable annotation. That is, the readRequest() internally looks as follows: @pausable Request readRequest() { while not request fully assembled readMore() return parseRequest(bytes) } This part is identical to the thread-per-request architecture. The difference is in readMore(): 98 5.4. NETWORK SUPPORT @pausable readMore() { while (not socket ready) if yielded thrice hop to selector and wait for socket ready event else Actor.yield() // at this point, socket is definitely ready with data readSocket() } One particularly useful aspect of having a selector scheduler manage the event loop is that it can run actors just as any other scheduler would, whether or not that actor has anything to do with a socket. Those actors wait on mailboxes, timer events and so on. In fact, one does not even need a separate session scheduler, which makes it identical to the uniprocessor event-loop code of yore. This decision can be taken at deployment time or at run-time and for selected actors. 5.4.2 Network Support: Related Work One measure of the complexity of a solution is the number of wrappers and tutorials needed to simplify it for the “the average programmer”. Java’s support for non-blocking I/O has spawned a cottage industry of such wrappers. In contrast, the thread-per-connection support hardly needs further elaboration, a testament to its simplicity; much of it arising due to its sequential nature. Few event-loop frameworks support a combination of user-level threads and non-blocking I/O. StateThreads167 is a fast C-based tasking facility with built-in support for I/O, but as with most other tasking frameworks, cannot directly utilize multiple threads and processors. The TaskJava project62 provides similar support in Java. The Kilim framework’s approach is unique in that I/O events and other events are given equal status; one can wait on both I/O events , messaging events and external events (UI events, for example). Scheduling is exposed to the application programmer so that any task can run on any scheduler, making possible a wide variety of mappings of tasks to schedulers, selectors, and threads. Evaluation and Performance 6 Kilim is designed for task parallel applications that divide their time between CPU, network and disks (split-phase). This chapter first demonstrates Kilim’s performance in isolated settings: purely CPU activities such as the speed of actor creation and messaging, then purely network I/O to add a fresh perspective to the ongoing blocking vs. non-blocking I/O debates. Next, we examine split-phase operations with more real-life constraints. We study the CPU-network interaction using the Kilim Web Server, then CPU-disk interaction using a variant of the Berkeley DB library modified to use Kilim. In all cases we compare a Kilim based solution to an industrial strength one. 6.1 In-core benchmarks The Erlang language is often described as a concurrency-oriented language. Since it typifies the lightweight thread and messaging paradigm in server settings, we evaluated Kilim against Erlang on three of the latter’s chief characteristics: ability to create many processes, speed of process creation and speed of message passing. The in-core tests were run on a 3GHz Pentium D machine with 1GB memory, running Fedora Core 6 Linux, Erlang v. R11B-3 (running HIPE) and Java 1.6. All tests were conducted with no special command-line parameters to tweak performance. Ten samples were taken from each system, after allowing the just-in-time compilers ( JITs) to warm up. The variance was small enough in all experiments to be effectively ignored. Process creation The first test (Fig. 6.1(a)) measures the speed of (lightweight Erlang) process creation. The test creates n processes (actors) each of which sends a message to a central accounting process before exiting. We measure the time taken from the start to the last exit message arriving at the central process. Kilim’s creation penalty is negligible (200,000 actors in 578ms, a rate of 350KHz), and scaling is linear. We were unable to determine the reason for the knee in the Erlang curve. Messaging Performance The second test (Fig. 6.1(b)) has n actors exchanging n 2 messages with one another. This tests messaging performance and the ability to make use of multiple processing elements (cores or processors). Kilim’s messaging is fast (9M+ messages at 0.54 µs per message, which includes context-switching time) and scales linearly. The important lesson from this test is that the performance of the stack winding and unwinding logic is pretty much a non-issue for most server-side requirements. 6.2. NETWORK BENCHMARK: KILIM I/O VS. THREAD-PER-REQUEST Creation time (ms) 2000 1500 103 Erlang Kilim 14 12 Total time (ms) 100 1000 500 Erlang Kilim 10 8 6 4 2 0 50K 100K 150K 0 200K 1000 2000 n clients (= n2 messages) n clients (a) Creation and Destruction 3000 (b) Messaging Figure 6.1: Erlang vs. Kilim times (lower is better). Kilim messaging is 3X faster. Exploiting parallelism The dual-core Pentium platform offered no tangible improvement (a slight decrease if anything) by running more than one thread with different kinds of schedulers (all threads managed by one scheduler vs. independent schedulers). We ran the messaging performance experiment on a Sun Fire T2000 machine with 32G total memory, eight cores on one chip and four hardware threads per core. We compared the system running with one thread vs. ten. Fig. 6.2 demonstrates the improvement afforded by real parallelism. Note also that the overall performance in this case is limited by the slower CPUs running at 1.4 GHz. 7000 6000 5000 1 Thread 10 Threads 4000 3000 2000 1000 0 500 1000 1500 2000 Figure 6.2: Kilim messaging performance and hardware parallelism. Lower is better. 6.2 Network benchmark: Kilim I/O vs. Thread-per-request This section compares the performance of the Kilim I/O architecture with thread-per-request blocking I/O. We wrote a simple echo server, that echoes a 500-byte packet sent by the client. For reasons to be detailed in §6.3.2, we wrote our own load-generator, Mob, that can simulate thousands of TCP CHAPTER 6. EVALUATION AND PERFORMANCE 101 and HTTP sessions. A client — one of 24,000 such concurrent sessions, running on four 16-core machines — makes 1000 requests in a closed loop fashion (waiting for the response before sending the next request). The thread-per-request model started giving connection problems beyond about 8000 simultaneous clients, hence Fig. 6.3 does not extend the comparison beyond that number; Kilim was stable until the measured limit. The tests were run on the 585-node Darwin high performance computing cluster at the University of Cambridge∗ . The job coordination system on this cluster ensures exclusive control over the compute nodes and permits several instances of the Mob stress-testing client to be started concurrently. The test environment is as follows. Hardware Dual socket Dell 1950 1U rack mount server 2 CPUs per node CPU x 2 Intel Woodcrest cores per CPU = 4 cores in total per node @ 3.00 GHz per core Primary Storage 8 GB per node (2 GB per core) Network Gigabit Ethernet Operating System ClusterVisionOS 2.1 Linux kernel 2.6.9-67.0.4.EL lustre.1.6.4.3smp x86 64 Server setup JVM heap: 800M, TCP listen backlog: 1000. All other settings left unmodified. Client setup JVM heap: 800M. A maximum of 4 nodes (16 processors) simulate up to a total of 24,000 clients. 6.2.1 Results Figure 6.3 shows two metrics. The bottom half shows completed calls per second with different numbers of concurrently connected clients. The top half represents a measure of fairness. It shows the number of in-flight client calls which, ideally, should be near the actual offered load (the number of configured clients) since this is an I/O intensive benchmark. Note that this number makes sense only when the average latency is comparable; otherwise a server could merely hold on to a pending request from every client and would be erroneously deemed fair. However, when all else is comparable, the number of active pending calls is a measure of whether the server (and the test setup) avoid convoy effects. The two models have roughly equal throughput when the number of connected clients is low, but the thread-per-request model’s fairness is poor, which means that some clients will see far worse turnaround time than others. Kilim I/O shows better and more uniform scalability while maintaining fairness. We are confident that there exist several design parameters that can improve the latency numbers for Kilim even in the lower ranges (such as adaptively tweaking the number of times we poll the socket before returning to the selector scheduler). This is left for future work. ∗ http://www.hpc.cam.ac.uk/darwin.html 102 6.3. CPU + NETWORK BENCHMARK: KILIM WEB SERVER % active 100 50 0 90k 80k Calls/sec 70k 60k MT/Blocking 50k Kilim/NIO 40k 30k 20k 10 10 0 0 50 2k 1k 4k # Connected Clients 6k 8k Figure 6.3: Kilim I/O vs. Thread-per-request blocking model. Higher is better. In reality of course, a server is never this simple. The next section investigates a mixture of CPU and network activity with more realistic workloads. 6.3 CPU + Network benchmark: Kilim Web Server As the terms “internet” and “web” become increasingly synonymous and the world prepares to consolidate services in the “cloud”, servers on the internet are moving beyond serving fixed-size files using short-lived TCP connections to providing dynamically generated content (webmail and search), coordination and routing (games), streaming (live audio/video), event notification (server-push) and online interaction (AJAX applications). This immediacy of interaction between clients and servers requires us to re-evaluate traditional web infrastructure components (content-caches, proxies) and to redesign servers to simultaneously reduce latency while catering to increasing concurrent workload. In fact, as we shall soon see, even tools such as httperf used for stress-testing web servers need revisiting. We modified an existing Java EE∗ compliant web server called TJWS, but the current API is too constraining in many ways: it is built on thread-blocking primitives, does not adequately support server-push models, a servlet is tailored to generate a response for a single request which spreads interactive conversations over several servlet instantiations, and so on. We then wrote the Kilim Web Server (KWS), an embeddable HTTP 1.1 compliant framework, to explore the applicability of lightweight threads and message passing in the entire application and networking stack. Considering that some of the major subsystems are taken care of in the core Kilim ∗ Java Enterprise Edition CHAPTER 6. EVALUATION AND PERFORMANCE 103 package itself (I/O support, kernel thread management, scheduling support), KWS is fairly lightweight (∼ 1300 lines)∗ , but demonstrates excellent performance and stability. 6.3.1 KWS architecture Each client socket connection is handled by a Session actor. As currently written, the Session is responsible for reading the socket, parsing the request, assembling the body (if the transfer encoding is chunked) and decompressing the body if necessary. One can choose to either hand-over the assembled request to a separate actor, or if one knows a-priori the category of requests that are going to come over a particular port, one can have the Kilim I/O framework spawn a custom session that can handle request assembly, application specific functionality and the response as one linear chain. This pattern is common to high performance websites; they dedicate separate ports (in fact, different IP addresses) to file transfer, interactive services etc. Our test application takes this route. The Kilim I/O support is especially suited to handling such linear flow of control. For example, it is easy to add SSL negotiation, admission control (for new socket connections) or overload control (to throttle client requests) either inline in the Session, or as a separate actor to handle just that aspect. There are a few performance engineering aspects of the KWS architecture that deserve an aside. First, we use the Ragel state machine library † to generate fast parsers — the generated code fits compactly in a small instruction cache, and all operations are on byte arrays (no function call overhead). Second, object creation is delayed until absolutely needed. For example, an HTTP header is a UTF-8 encoded string with several key-value pairs, most of which are never examined. Instead of decoding the header and creating Java String objects for all the components of the header, KWS merely records the offset and length pairs (occupying the top and bottom halves of a 32-bit integer). Third, KWS uses the Kilim non-blocking I/O library mentioned earlier. This is not only considerably faster than a kernel-thread-per-request and blocking I/O, it is parsimonious with memory (at 25,000 connections, the resident memory used is 768M in the application described below). Fourth, all file transfers are done using Java support for sendfile, where available. When dealing with up to 90,000 requests/second (Fig. 6.3), these optimizations add up. 6.3.2 KWS application: auto-completion Static web server performance has been well-studied and several specialized and extremely lightweight file servers have been developed.141 The purpose of the following benchmark is to investigate the serving of short, dynamic content in an interactive context, as seen in server-side auto-completion on search bars and on forms, game and chat servers, etc. Messages are asynchronous (overlaid on the HTTP request/response paradigm), very short (a few hundred bytes, in contrast to dynamic content from news portals) and require low latency for online response, etc. The sample benchmark application is a simple word completion web service that, given a prefix string, binary-searches a list of 400,000 words and returns up to ten words that complete the prefix string. Each of (up to 24,000) clients makes a connection, and repeatedly requests words with no intervening think time; it is an exercise in studying response rates and overload handling. Unfortunately, ∗ There is no support yet for SSL, cookie generation and session stickiness. † http://www.complang.org/ragel/ 104 6.3. CPU + NETWORK BENCHMARK: KILIM WEB SERVER none of the existing load generators such as ab, httperf128 and Siege160 addressed our requirements satisfactorily. For example, in spite of specifying 2000 sessions to be started, the maximum number of sockets opened with httperf would be of the order of 700. The second factor is that these tools report the typical statistical measures – average, standard deviation, and median – none of which are really useful in practice. Most high-performance commercial websites use the 99.9th percentile of latency measures and other metrics, in a bid to address the worst case experience.54 The third factor leading to writing our own stress-testing client, called Mob, was that we wanted access to the raw call-level data for subsequent analysis; for one, this makes it possible to aggregate statistics from separate nodes. The benchmark compared the performance of KWS with the popular (and beautifully engineered) Jetty Web Server .97 This choice was dictated by the fact that Jetty is written in Java and is modular, allowing one to use only the core module without the heavyweight Java EE superstructure getting in the way. The feature set of Kilim and Jetty used by this test is roughly the same. Each instance of Mob starts up to 6000 sessions (TCP connections), each making 1000 HTTP GET requests at a specified rate (300-1000 req/s). Raw data for each call (begin and end times, bytes transferred, number of errors etc.) is gathered during the middle third of each session, to ensure that all clients and servers are in a steady state. The request-rate parameter was ratcheted down from 1000 requests/sec to 300, until the server could repeatedly sustain the given load and number of sessions without giving any connection errors. 6.3.3 Results In our test, it took 25,000 connections to finally saturate the 4 cores of the server (with Kilim). It goes without saying that this kind of a load is never delivered to a single server in practice (perhaps only in a denial of service attack); for one, any business fortunate enough to get 25,000 concurrent clients would scarcely rest its fortunes on the vagaries of a single machine. However, it does mean that significant loads can be handled by considerably less powerful machines, which is important as we enter the era of counting watts per transaction. As before, we combine server throughput (requests completed per second) and a fairness metric (number of clients currently active in a call). We could not get Jetty to reliably scale beyond about 5000 concurrent sessions without getting connection timeouts (there is no overload or admission control in the core module). KWS sustains the maximum offered request rate all the way up to 24,000 clients, when the server CPUs (all 4 cores) are saturated. In addition to delivering significantly better performance, KWS shows close to 100% level of fairness in keeping its client occupied, as the top half of figure 6.4 demonstrates. Figure 6.5 shows fairness in a different way. In a particular run with 24,000 concurrent client sessions, the histogram classifies requests by round-trip time (a total of 12 million calls). Notice that more than 95% of the clients show a steady latency of 400-450ms at full load (with the clients taking no think time). It shows that the server is fair to all connected clients, and such steadiness allows one to make reliable UI decisions. Remark. In practice, data centers focus on two somewhat conflicting requirements: minimizing latency and providing redundancy. When a primary machine fails and the backup takes over, the only way to ensure that the 99.9 t h percentile of latency is within acceptable bounds is to make sure there is enough CPU and network headroom on all machines. Although utilization of a machine or the CHAPTER 6. EVALUATION AND PERFORMANCE 105 % active 100 80 60 40k Calls/sec 30k 20k jetty Kilim Web Server 10k k 24 k 20 k 18 k k 16 # Connected Clients 15 12 k k 10 8k 6k 4k 2k 1k 0 50 0 10 10 0k Figure 6.4: Auto-Completion service: Kilim Web Server vs. Jetty. Higher is better. 80 % calls 60 40 20 0 0 200 400 600 800 Latency (ms) Figure 6.5: Kilim latency distribution at 24,000 clients. Almost all clients get a response between 400500 ms. 106 6.4. CPU + DISK: BERKELEY DB network is not an interesting measure, our tests show that it is possible to handle a lot more clients with less (or less powerful) hardware. 6.4 CPU + Disk: Berkeley DB Performance has become such an important differentiating factor that an increasing amount of data has been steadily migrated out of relational databases and into much simpler key-value schemes such as Google’s BigTable . Underneath most of these lies the Berkeley DB (BDB) library, an embeddable, transactional key-value storage engine and one of the most popular storage managers in use today. 6.4.1 Berkeley DB: original architecture Most such storage engines (including the version of BDB written in C) employ a B-tree to couple in-memory data structures to disk blocks. An update to the in-memory structure is mirrored by a corresponding in-place disk write. If the update is made under the aegis of a transaction, the database first writes the data to a write-ahead log and locks the corresponding records, and upon commit, the data is copied to the corresponding disk block in the data file. The architecture of the Java version of BDB is unique among storage engines (it bears scant resemblance to even the C version). It eschews in-place update of disk blocks for a single append-only log-structured disk representation containing all data: application data, database meta-data and the transactional log. It is still a B-tree scheme, but without in-place disk updates. This design not only avoids separate disk writes of log and content data, the append-only strategy avoids random seeks, yielding excellent insert and write performance. As it turns out, the code has been so tightly written that for most typical workloads, it is extremely hard to force it to stress the I/O system much. The downside is that the shared structure necessitates an elaborate latching and locking scheme. Latches are short-term reader-writer locks meant to protect shared data structures from concurrent updates, while the term “lock” is a read-write lock that protects application data until the completion of its owning transaction. To provide high concurrency, every B-tree node has its own latch. A rootto-leaf traversal involves latch-coupling: latch a node, then latch the appropriate child node, then let go of the parent latch. An insertion into a full leaf triggers node-splits from the leaf node all the way to the top, which means the latching order is bottom up, a situation primed for deadlock. There are still other areas where several updates have to happen concurrently (updates to lock manager, committing changes to B-tree nodes etc.), all of which require extensive use of Java monitors. 6.4.2 Berkeley DB: Modifications Our original goal was to write a key-value store (using cache-adaptive log structured merge trees135 ) with an in-memory non-sharing model based on Kilim actors that vertically partition the key space amongst available processors. The difference from parallel database and distributed consistent-hashing approaches is that, while “locally distributed”, the solution would still use shared memory to balance the assignment of partitions to processors (a simple non-blocking linearizable operation). Although much progress was made on this database and several fruitful discussions were held with the authors CHAPTER 6. EVALUATION AND PERFORMANCE 107 of Berkeley DB and Facebook, the project has been temporarily set aside because a fair comparison to an existing product entails building a complete product in short order. This section reports on a more modest experiment based on the following observations. The OS does not know about transactions or read-write locks or the fact that earlier transactions should be given priority to later ones (the idea being that a later transaction, having accomplished less, should be the one to block or rollback). Further, preemptive thread scheduling leads to split locks and priority inversion if a thread is de-scheduled while in possession of a logical lock. The odds of collision and priority inversion can only increase (exponentially, in the worst case) with an increase in the number of threads and hotspots. We modified BDB (the Java version) to represent each application transaction as an actor; the original uses a thread-per-transaction model. The actors are coordinated using mailboxes instead of blocking the thread as traditional read-write locks are wont to do. Most of the existing structure of the code is left unmodified (latching, lock management, recovery, etc.) The idea behind the modification was the intuition that it should pay to dissociate hundreds of thousands of logical threads of control from actual threads, and to control their scheduling based on a user-level transaction-savvy scheduler. This way, most logical transactions can be taken through to completion instead of attempting to be fair. Also, because we create only as many kernel threads as processor cores (§5.3), thereby treating the kernel thread as a virtual processor, there are far fewer threads that are actually contending for a lock. Before reporting on the performance results, here is a brief look at what the modifications to BDB entailed. About a fourth of the 5000 methods had to be annotated as @pausable (with a concomitant 18% increase in the size of the generated code). Several pre-existing classes (such as I/O streams and locks) that are wont to block the thread were replaced with pausable variants. All daemon threads (responsible for log cleanup, log writing etc.) were converted to actors. The biggest trouble came from an extensive use of synchronized blocks, which had to be replaced with an explicit acquire/release pair of calls; in addition, because Java locks are owned by the thread, we had to ensure that the actors were rescheduled in the same thread where they had acquired the lock. The scheduler used a priority queue (ordered by transaction start time) instead of a first-come first-served order. Given the usage pattern of key-value stores in many modern applications, this is a viable approach. 6.4.3 Berkeley DB: Benchmark and Results The workload consists of n logical threads that attempt to get or update a random key (in an 80/20 ratio respectively). The BDB infrastructure’s caching schemes are finely tuned and highly adaptive and most server-class machines are big enough to store the entire working set of accessed keys in memory. This means that most of the disk activity is due to the relatively infrequent updates. In other words, lock collisions and thread scheduling play a much higher part in the performance numbers than does disk I/O. We measured the elapsed time for 1000 get/put calls made by each client on a database of size 100,000, averaged over five runs for each concurrent client count. The times were measured on an Apple MacBook Pro (dual-core 2.4GHz processor, 2G memory, 3MB L2 Cache, Mac OS X v.10.5). Figure 6.6 shows the inverse number: the number of requests handled (in thousands per second). In spite of the somewhat un-aesthetic modifications to the Berkeley DB core, the Kilim variant of BDB showed three improvements: 108 6.5. QUALITATIVE EVALUATION: KILIM IN PRACTICE 103 100 Original BDB x000 gets,puts /sec Kilim BDB 80 60 40 20 0 1 0 10 0 20 0 0 0 0 0 0 0 0 0 50 100 150 200 250 400 600 800 000 1 # Threads/Actors Figure 6.6: Original vs Kilim-based Berkeley DB. Higher is better. Throughput. Kilim-BDB shows up to a 30% increase in throughput, then levels off due to CPU saturation. Note that the original threaded code starts the leveling off much earlier. Scalability. Adding more actors is merely an exercise in adding objects to the heap and to the scheduler’s run-queue and because of the run-to-completion property, it stands to reason that the Kilim variant can easily accommodate a few tens of thousands more threads of control; we expect this number to be limited only by application session and transaction timeouts. Memory usage. The resident memory usage of the Kilim variant was routinely less than half of the thread version. What was surprising, however, is that Kilim-BDB had marginally more overall (4%) CPU usage than the original. 6.5 Qualitative Evaluation: Kilim in practice The Kilim toolkit has been downloaded tens of thousands of times and has seen warm acceptance from Java server developers, particularly from the financial industry. At this early stage of adoption, there are, as yet, no published experience papers, so what follows is a brief, anecdotal mention of four different types of usage. Financial Analytics One London-based financial analysis firm is using Kilim to run thousands of analytics based on real-time data feeds; each analytic instance subscribes to particular stock market events and trends and feeds other analysis tasks. Many of these analytic applications are linear in that they wait for several events “inline”, then proceed to the next stage of computation; the sequential nature of a Kilim actor is considerably less convoluted than a callback-driven piece of code. CHAPTER 6. EVALUATION AND PERFORMANCE 109 Server infrastructure A very fast web server, asynchttp∗ , based on the Kilim core framework, was written independently by Jan Kotek. Our adaptive I/O infrastructure was inspired by one key demonstration of the asynchttpd project, that there is a considerable difference in performance between polling all sockets in a round-robin fashion versus registering them with a selector for socket-ready events. The problem with this approach is that it sends the CPU into a looping frenzy if none of the sockets are active enough. We added support for detecting inactive sockets, by hopping over to the selector scheduler after a few failed polling attempts. Actor Frameworks The Actor Foundry† project at the University of Illinois at Urbana Champaign provides a purer actor-oriented programming model than Kilim: its objective is to treat everything as actors (the console, for example, is an actor to which you send a print message). The infrastructure uses the Kilim weaver for CPS transformation, but provides its own runtime infrastructure, including components for messaging, actors and schedulers. Unlike Kilim, they copy messages for safety. The project is currently exploring various issues related to fair scheduling. Orchestration Actor networks, workflow networks and dataflow networks have one idea in common: components have input and output ports and are wired together, often visually. The Orc project at the University of Texas at Austin provides a language to do this wiring. This project’s focus is assembling data flows from distributed computations while taking into account fault-tolerance and wide-area communication latency. The language itself provides just three simple, but powerful operators: parallel computation, sequencing, and selective pruning, which are general enough to express all orchestration requirements.126 The Orc project uses Kilim for its threading transformation, to scale up to thousands of ongoing orchestrated activities. The Kilim programming model allows the programmer to wait for an event in the middle of a for loop. ∗ http://kotek.net/asynchttpd † http://osl.cs.uiuc.edu/af/ Conclusions and Future Work 7 Server-side programmers face a daunting array of objectives and constraints, often at odds with each other: quality of service (low latency and latency jitter, reliability, overload control), effective use of resources (CMP/SMP boxes, multiple data centers, throughput, energy consumption), ease of development, security, online diagnosis, coexistence in a multi-language, multi-technology environment, and so on. These issues directly translate into a large number of choices in software architectures for the server-side developer. This dissertation builds on the observation that a single server is typically deployed as part of a distributed setup, as well as on a distributed setup. In other words, a server is a part of a fault-tolerant, load-sharing cluster of servers, and separately, that the hardware platform on which a server is deployed, an SMP machine, makes no attempt to hide the fact that it is a distributed system. We propose that a distributed architecture at the intra-process level is one that seamlessly binds these two aspects. This makes the overall architecture somewhat self-similar, and solutions at the intra-process level can draw upon parallels from more macro levels. Specifically, we propose a communicating actor architecture, where actors are heap-isolated state machines with private stacks, and communicate using explicit messaging protocols. In order for this architecture to be practical, we argue that the programming language and environment must support a clutch of architectural features: (i) Threads lightweight enough to comfortably exceed the number of concurrent users and tasks, (ii) memory isolation between actors, to promote lock-free programming and to eliminate having to think about memory consistency models, (iii) the ability to pull external events into the actor’s space (the opposite of reactive programming) (iv) customizable, user-extensible synchronization and scheduling constructs. Our contribution is in delivering this vision portably (without changes to syntax or to the underlying virtual machine) in a mainstream programming language such as Java, and without sacrificing the familiar idiomatic style of the language (sequential control flow coupled with pointers and mutable objects). This work is available as an open-source toolkit called Kilim.103 Kilim provides very lightweight and fast preemptible user-level threads, which promote linear, imperative control flow that is simple to understand and is composable, unlike a reactive inversion-ofcontrol style fostered by callback-oriented code. The speed of context-switching and the memory footprint of Kilim actors compare very favorably to mainstream concurrent languages. The Kilim type system statically guarantees that actors are memory-isolated by separating messages and internal objects, and by controlling pointer heap aliasing in messages. This is done by ensuring that each object in a message can have at most one other object — either an actor or another message object — point to it (and at most via a single field). This constraint promotes both linear ownership as well as a tree structure. The single ownership of messages and confinement of all other non-message objects fosters a programming model without locks and without fear of external modification. The 112 7.1. FUTURE WORK inductive tree structure of messages permits fast serialization and simple resource management (one can trivially recycle a buffer by separating it from its current parent, secure in the knowledge that no other object is pointing to it). Finally, the Kilim toolkit allows messaging, synchronization and scheduling constructs to be usercustomizable and extensible. The toolkit approach differs from existing concurrent languages with built-in fixed sets of primitives, and frameworks (such as Java Enterprise Edition and CORBA) that impose a fixed programming style. While the threading, type-system and run-time library aspects are largely orthogonal, we make the case that they reinforce each other’s strengths when provided together. For example, one can build a Haskell-style MVar or a swap primitive while relying on the type system to control pointer aliasing. 7.1 Future Work The Kilim framework is but a first step in demonstrating the viability of the communicating actor approach in a server environment. We will outline a few avenues of research and implementation, some that extend the Kilim framework as it exists, some that explore fresh implementations that preserve some or all of the original aims (portability, imperative control flow, pointers and mutable state). Actor-Oriented JVM It is possible to improve Kilim’s context-switching speeds by two orders of magnitude, by integrating actor support in the JVM. Unlike traditional coroutine support and greenthread implementations of earlier JVMs, the scheduling aspect must be user-customizable, and schedulers should be able to signal preemption. One challenge is in sizing the stack dynamically, should one opt for a stack-centric approach (in contrast to a CPS approach, where the call chain is in the heap). A JVM’s knowledge of actors and type-directed memory confinement simplifies garbage collection tremendously. It also permits better memory layout for both the internal state of an actor as well as for messages, and different garbage collection strategies for them. A VM’s ability to make available actor-specific profiling and accounting metrics to the user-level would foster the creation of adaptive schedulers. Distribution support The one-way “send-and-pray” semantics of asynchronous messaging presents a uniform veneer over collocated and remote targets. However, failure semantics are considerably different, and latency differences are considerable. That said, given the vastly improved speeds of networks and infrastructure reliability, it is worth exploring reliable facilities (such as leasebased garbage collection) to share resources such as mailboxes, for higher-order synchronization objects,4 for well specified interface automata,43, 44, 171 and for accounting of finite resources such as security capabilities. One useful optimization may be to pay particular attention to objects that are just shared between pairs of processes, because many workflows are linear or fork-join. Orchestration As scripting approaches such as Kamaelia,102 orchestration languages such as Orc126 and dependency injection frameworks such as guice79 have shown, there is much value to dynamic binding of components. When building a component, there should be no a priori knowledge of which other component it will connect to, or the rate of message arrivals. These parameters may be supplied at deployment time and may inform the creation of specialized, deterministic scheduling policies. CHAPTER 7. CONCLUSIONS AND FUTURE WORK 113 Syntactic improvements Languages such as Scala and F# have shown an elegant upgrade path for mainstream languages; in particular, the availability of higher-order functions, Smalltalk-style block closures, and pattern-matching is extremely attractive. There is much scope for providing other essential features we have mentioned earlier, such as statically guaranteed memory isolation and systemic knowledge of actor-scheduling and of actor boundaries in debuggers, profilers and so on. Scheduling on a hierarchy of processors Many scheduling approaches recognize a hierarchy of schedulable objects, but not always a hierarchy of processing elements. It would be useful to develop a system for an entire cluster, and have a uniform way of integrating scheduling and placement aspects at all levels down to a particular core on a particular machine, especially in conjunction with orchestration approaches. It does not necessarily have to entail actor migration (due to scheduler hopping). Interruption and Cancellation semantics A principled study of the value of different interruption approaches is required. The POSIX-style signal handling approach has the same problems of multi-threaded access to shared data, while passive notifications (sending a message to a specially marked “control” mailbox, for example) have the problem of leaving it to the programmer to explicitly poll for them. The second problem is how to react to such notifications, and the manner in which the signal handler can influence further processing; should it kill the actor or raise an exception? One promising avenue is to use the exception handling syntax to install lexically and dynamically scoped signal handlers, and further, to add resumption semantics for the flow of control to resume from where the exception was thrown. Higher-level threads of control An application or workflow has several hierarchical threads of control and nested contexts, which is reflected in the way components and systems are connected and in the way in which data moves. For example, the user-level process of buying a widget may manifest itself as several TCP connections and short-term sessions. Much information can be gleaned from tracking such higher-level workflows, from profiling to denial of service attacks. One option may be for messages to carry around tokens representing hierarchies of contexts (on behalf of which work is being performed), and if actors can be made to attach those tokens (or generate dependent tokens) to outgoing messages, then a dynamic taint analysis can yield considerable information. These mechanisms could be further integrated with run-time instrumentation approaches such as dtrace40 and profiling frameworks such as ARM (Application Response Measurement).11 Automated Reasoning JML116 tools such as ESC/Java50 attempt to find common run-timer errors using static analysis, with the help of theorem provers. The non-concurrent nature of an individual actor lends itself easily to such tools. Security and Information flow We have only paid attention to data flow with the tacit understanding that the possession of a non-forgeable object identifier is equivalent to possessing a capability. We have, in related work,59 explored the use of the type system to constrain the ownership and duplication of capabilities. A type-systemic approach needs a robust runtime component that prevents forgeries and permits revocation, especially in a distributed system. The E language and environment have much to teach us in this regard. We currently treat data as just bits, not as information with differing levels of importance, secrecy and so on. Moving to an information-savvy system requires tracking control flow to prevent computations on low-valued information affecting higher-valued information.98, 130 Integrating operating systems and servers We believe that the structure of modern operating sys- 114 7.1. FUTURE WORK tems emphasizes a rigid and complicated division between user and kernel space, and that much of it is accidental complexity attributable to the use of C as a systems programming language. As the Singularity project has shown, given strongly typed languages and static verification schemes such as proof-carrying code, it is possible to treat the kernel as a collection of peer actors sharing the same physical address space and hardware protection level as any application object; mechanisms such as device-driver interrupts and scheduler activations are merely messages from one part of the system to another, and fault-isolation is a natural byproduct of an actor mindset. We intend to investigate type-directed, finer-grained isolation in the Barrelfish22 operating system. Notation and Nomenclature e •S Add e to set S ({e} ∪ S). Also used as a pattern match operator. ¬ Defined as ∗ Wildcard, dummy, place-holder x List of x qτ List of pairs: 〈q0 τ0 〉, 〈q1 τ1 〉, . . . , 〈qn τn 〉 〈. . .〉 An n-tuple ( a 7→ b µ[a 7→ b ] Map ¬ λx. b 8 if x = a undef otherwise ( b if x = a Extension of any map µ : λx. µ(x) otherwise δ Concrete path (reachability) pg. 61 ∆ Abstract path pg. 61 δ Run-time function state (activation frame + typestate map) Def. 4.2, pg. 59 ∆ Def. 4.3, pg. 59 ˆ ∆0 , ∆ Abstract function state corresponding to δ Abstract post-state Fig. 4.7, pg. 70 β Abstraction function. Converts δ = 〈σ, H , s〉 to ∆ = 〈N , E, Σ〉 Def. 4.5, pg. 60 ρ Object reference in the concrete state τ Meta-variable ranges over types Fig. 3.1, pg. 42 σ Local register file. Maps var names to a value Fig. 3.2, pg. 44 ς Concrete object’s typestate: ρ 7→ {null, f, w, r, >} §4.3.1, pg. 56 Σ Abstract node’s typestate: n 7→ π, where π ∈ {null, f, w, r, >} Def. 4.3, pg. 59 Σ0 , Σ̂ Abstract post-state Fig. 4.7, pg. 70 Γ Phase-1 Type environment §4.2, pg. 54 A Actor Fig. 3.2, pg. 44 b Mailbox ids Fig. 3.1, pg. 42 cluster Cluster of objects reachable from object Fig. 3.3, pg. 45 scluster Shape cluster. Compatible nodes reachable from a given node Def. 4.7, pg. 62 Es Set of abstract edges Def. 4.3, pg. 59 Es , Ês Set of edges in post-state (corresponding to Es) Fig. 4.7, pg. 70 f Free (unowned, writable). See ς 0 116 f,g Field names Fig. 3.1, pg. 42 H Heap of an actor Fig. 3.2, pg. 44 lb Label for jump instructions Fig. 3.1, pg. 42 L Labeling function Def. 4.4, pg. 60 M Meta-variable for message structures names Fig. 3.1, pg. 42 nX Abstract node with label X . A label represents a set of variable names pg. 58 n; Abstract node with an empty label pg. 58 Set of abstract nodes Def. 4.3, pg. 59 Ns , N̂s Set of abstract nodes in post-state Fig. 4.7, pg. 70 P Post Office Fig. 3.2, pg. 44 q Type qualifier Fig. 3.1, pg. 42 r Readonly (not writable; may or may not be owned). See ς s Ranges over statements Fig. 3.4, pg. 46 S Call stack Fig. 3.2, pg. 44 v, x, y Variable names Fig. 3.1, pg. 42 w Writable and Owned. See ς Ns 0 Bibliography [1] ActorFoundry. 9 [p. 85] http://osl.cs.uiuc.edu/af/. [2] Atul Adya, Jon Howell, Marvin Theimer, William J. Bolosky, and John R. Douceur. Cooperative task management without manual stack management. In USENIX Annual Technical Conference, pages 289–302, 2002. [pp. 17, 85, 88, and 89] [3] Gul Agha and Carl Hewitt. Actors: A conceptual foundation for concurrent object-oriented programming. In B. Shriver and P. Wegner, editors, Research Directions in Object-Oriented Programming, pages 49–74. MIT Press, 1987. [p. 16] [4] Marcos Kawazoe Aguilera, Arif Merchant, Mehul A. Shah, Alistair C. Veitch, and Christos T. Karamanolis. Sinfonia: a new paradigm for building scalable distributed systems. In Symposium on Operating Systems Principles (SOSP), pages 159–174, 2007. [p. 112] [5] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers, Principles, Techniques, and Tools. Addison-Wesley, 1986. [p. 55] [6] Jonathan Aldrich, Valentin Kostadinov, and Craig Chambers. Alias annotations for program understanding. In ACM International Conference on Object-Oriented Programming, Systems and Applications (OOPSLA), pages 311–330, 2002. [p. 81] [7] Paulo Sérgio Almeida. Balloon types: Controlling sharing of state in data types. In European Conference on Object-Oriented Programming (ECOOP), pages 32–59. Springer-Verlag, 1997. [p. 79] [8] Lars O. Andersen. Program Analysis and Specialization for the C Programming Language. PhD thesis, DIKU, University of Copenhagen, May 1994. (DIKU report 94/19). [p. 77] [9] Anderson, Bershad, Lazowska, and Levy. Scheduler activations: Kernel support for effective user-level thread management. Technical Report 90-04-02, University of Washington, 1990. [pp. 31 and 85] [10] Andrew W. Appel. Compiling with Continuations. Cambridge University Press, 1991. [11] Application response measurement (ARM). [p. 84] http://www.opengroup.org/management/arm. [p. 113] [12] Joe Armstrong. Making reliable distributed systems in the presence of software errors. PhD thesis, The Royal Institute of Technology, Stockholm, 2003. [p. 39] [13] Joe Armstrong. Programming Erlang: Software for a Concurrent World. Pragmatic Bookshelf, 2007. [pp. 15 and 36] 118 BIBLIOGRAPHY [14] David F. Bacon, Ravi B. Konuru, Chet Murthy, and Mauricio J. Serrano. Thin locks: Featherweight synchronization for Java. In Programming Language Design and Implementation (PLDI), pages 258–268, 1998. [p. 38] [15] David F. Bacon, Robert E. Strom, and Ashis Tarafdar. Guava: a dialect of Java without data races. In ACM International Conference on Object-Oriented Programming, Systems and Applications (OOPSLA), pages 382–400, 2000. [p. 79] [16] Jean Bacon. Concurrent Systems. Addison Wesley, 1998. [p. 36] [17] Henry G. Baker. Unify and conquer. In Conference on LISP and Functional Programming, pages 218–226, 1990. [pp. 78 and 79] [18] Henry G. Baker. ’Use-once’ variables and linear objects - storage management, reflection and multi-threading. ACM SIGPLAN Notices, 30(1):45–52, 1995. [p. 78] [19] Fred Barnes and Prof. Peter H. Welch. Communicating Mobile Processes. In Dr. Ian R. East, Prof David Duce, Dr Mark Green, Jeremy M. R. Martin, and Peter H. Welch, editors, Communicating Process Architectures 2004, pages 201–218, 2001. [20] John Barnes. High Integrity Software: The SPARK Approach to Safety and Security. AddisonWesley, 2003. [p. 18] [21] Rimon Barr, Zygmunt J. Haas, and Robbert van Renesse. JiST: an efficient approach to simulation using virtual machines. Software—Practice and Experience, 35(6):539–576, 2005. [p. 95] [22] Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schuepbach, and Akhilesh Singhania. The multikernel: a new OS architecture for scalable multicore systems. In Symposium on Operating Systems Principles (SOSP). ACM Press, October 2009. [pp. 39 and 114] [23] Andrew Baumann, Simon Peter, Adrian Schuepbach, Akhilesh Singhania, Timothy Roscoe, Paul Barham, and Rebecca Isaacs. Your computer is already a distributed system. Why isn’t your OS? In Hot Topics in Operating Systems (HotOS), May 2009. [p. 34] [24] Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy. User-level interprocess communication for shared memory multiprocessors. Technical Report 90-05-07, University of Washington, 1990. [p. 17] [25] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An efficient multithreaded runtime system. In Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 207–216, July 1995. [pp. 89 and 94] [26] Hans-Juergen Boehm. Threads cannot be implemented as a library. In Programming Language Design and Implementation (PLDI), pages 261–268, 2005. [p. 24] [27] Hans-Juergen Boehm and Sarita V. Adve. Foundations of the C++ concurrency memory model. In Rajiv Gupta and Saman P. Amarasinghe, editors, Programming Language Design and Implementation (PLDI), pages 68–78. ACM, 2008. [p. 26] BIBLIOGRAPHY 119 [28] Jeff Bogda and Urs Hölzle. Removing unnecessary synchronization in Java. In ACM International Conference on Object-Oriented Programming, Systems and Applications (OOPSLA), pages 35–46, 1999. [p. 38] [29] Shekhar Borkar. Thousand core chips: a technology perspective. In DAC ’07: Proceedings of the 44th annual Design Automation Conference, pages 746–749. ACM, 2007. [p. 33] [30] Frederic Boussinot. Fairthreads: mixing cooperative and preemptive threads in C. Concurrency and Computation: Practice and Experience, 18(5):445–469, April 2006. [p. 95] [31] Chandrasekhar Boyapati. SafeJava: a unified type system for safe programming. PhD thesis, Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003. [p. 80] [32] John Boyland. Alias burying: Unique variables without destructive reads. Software—Practice and Experience, 31(6):533–553, 2001. [pp. 76 and 79] [33] John Boyland. Why we should not add readonly to Java (yet). Journal of Object Technology, 5(5):5–29, 2006. [pp. 29 and 77] [34] John Boyland, James Noble, and William Retert. Capabilities for sharing: A generalisation of uniqueness and read-only. In European Conference on Object-Oriented Programming (ECOOP), volume 2072 of LNCS, pages 2–27, 2001. [p. 81] [35] Jean-Pierre Briot, Rachid Guerraoui, and Klaus-Peter Löhr. Concurrency and distribution in object-oriented programming. ACM Computing Surveys, 30(3):291–329, 1998. [pp. 34 and 36] [36] Fredrick P. Brooks. The Mythical Man-Month: Essays on Software Engineering, 20th Anniversary Edition. Addison Wesley, Reading, Mass., second edition, 1995. [p. 23] [37] Carl Bruggeman, Oscar Waddell, and R. Kent Dybvig. Representing control in the presence of one-shot continuations. In Programming Language Design and Implementation (PLDI), pages 99–107, 1996. [p. 84] [38] Alan Burns and Andy Wellings. Concurrent and Real-Time Programming in Ada. Cambridge University Press, 2007. [p. 15] [39] Bryan Cantrill and Jeff Bonwick. Real-world concurrency. Communications of the ACM, 51(11):34–39, 2008. [p. 37] [40] Bryan Cantrill, Michael W. Shapiro, and Adam H. Leventhal. Dynamic instrumentation of production systems. In USENIX Annual Technical Conference, General Track, pages 15–28. USENIX, 2004. [p. 113] [41] Richard Carlsson, Konstantinos F. Sagonas, and Jesper Wilhelmsson. Message analysis for concurrent languages. In Static Analysis Symposium (SAS), volume 2694 of LNCS, pages 73–90, 2003. [p. 38] [42] Calin Cascaval, Colin Blundell, Maged M. Michael, Harold W. Cain, Peng Wu, Stefanie Chiras, and Siddhartha Chatterjee. Software transactional memory: Why is it only a research toy? ACM Queue, 6(5):46–58, 2008. [p. 38] 120 BIBLIOGRAPHY [43] Satish Chandra, Brad Richards, and James R. Larus. Teapot: Language support for writing memory coherence protocols. In Programming Language Design and Implementation (PLDI), pages 237–248, 1996. [p. 112] [44] Prakash Chandrasekaran, Christopher L. Conway, Joseph M. Joy, and Sriram K. Rajamani. Programming asynchronous layers with CLARITY. In Ivica Crnkovic and Antonia Bertolino, editors, ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 65–74. ACM, 2007. [pp. 17 and 112] [45] Sigmund Cherem and Radu Rugina. A practical escape and effect analysis for building lightweight method summaries. In Compiler Construction, 2007. [p. 77] [46] Jong-Deok Choi, Manish Gupta, Mauricio J. Serrano, Vugranam C. Sreedhar, and Samuel P. Midkiff. Escape analysis for Java. In ACM International Conference on Object-Oriented Programming, Systems and Applications (OOPSLA), pages 1–19, 1999. [p. 77] [47] Dave Clarke. Object Ownership and Containment. PhD thesis, School of Computer Science and Engineering, University of New South Wales, 2001. [p. 80] [48] Dave Clarke and Tobias Wrigstad. External Uniqueness is unique enough. In European Conference on Object-Oriented Programming (ECOOP), volume 2743 of LNCS, pages 176–200, 2003. [p. 80] [49] David G. Clarke, John Potter, and James Noble. Ownership types for flexible alias protection. In ACM International Conference on Object-Oriented Programming, Systems and Applications (OOPSLA), pages 48–64, 1998. [p. 80] [50] David R. Cok and Joseph Kiniry. ESC/Java2: Uniting ESC/Java and JML. Technical report, University of Nijmegen, 2004. NIII Technical Report NIII-R0413. [p. 113] [51] Melvin Conway. Design of a separable transition-diagram compiler. In Communications of the ACM, 1963. [p. 14] [52] Karl Crary, David Walker, and Greg Morrisett. Typed memory management in a calculus of capabilities. In Principles of Programming Languages (POPL), pages 262–275. ACM, 1999. [p. 81] [53] Grzegorz Czajkowski and Laurent Daynès. Multitasking without compromise: A virtual machine evolution. In ACM International Conference on Object-Oriented Programming, Systems and Applications (OOPSLA), pages 125–138, 2001. [p. 39] [54] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon’s highly available key-value store. In Symposium on Operating Systems Principles (SOSP), 2007. [pp. 15 and 104] [55] Robert DeLine and Manuel Fähndrich. Enforcing high-level protocols in low-level software. In Programming Language Design and Implementation (PLDI), pages 59–69, 2001. [p. 82] [56] Werner Dietl and Peter Müller. Universes: Lightweight ownership for JML. Journal of Object Technology, 4(8):5–32, October 2005. [p. 80] BIBLIOGRAPHY [57] ejabberd: Erlang Jabber/XMPP daemon. 121 http://ejabberd.jabber.ru. [p. 40] [58] Robert Ennals, Richard Sharp, and Alan Mycroft. Linear types for packet processing. Technical report UCAM-CL-TR-578, University of Cambridge, Computer Laboratory„ January 2004. [p. 78] [59] David M. Eyers, Sriram Srinivasan, Ken Moody, and Jean Bacon. Compile-time enforcement of dynamic security policies. In 9th IEEE International Workshop on Policies for Distributed Systems and Networks, pages 119–126. IEEE Computer Society, 2008. [pp. 55, 69, 75, and 113] [60] Manuel Fähndrich, Mark Aiken, Chris Hawblitzel, Orion Hodson, Galen C. Hunt, James R. Larus, and Steven Levi. Language support for fast and reliable message-based communication in Singularity OS. In Proc. of EuroSys, 2006. [pp. 15, 18, and 82] [61] Manuel Fähndrich and Robert DeLine. Adoption and focus: Practical linear types for imperative programming. In Programming Language Design and Implementation (PLDI), pages 13–24, 2002. [p. 82] [62] Jeffrey Fischer, Rupak Majumdar, and Todd Millstein. Tasks: Language support for eventdriven programming. In Partial Evaluation and Program Manipulation (PEPM), 2007. [p. 98] [63] Bryan Ford and Sai Susarla. CPU inheritance scheduling. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 91–105, 1996. [p. 95] [64] Jeffrey S. Foster, Tachio Terauchi, and Alexander Aiken. Flow-sensitive type qualifiers. In Programming Language Design and Implementation (PLDI), pages 1–12, 2002. [p. 59] [65] Fraser and Harris. Concurrent programming without locks. ACM Transactions on Computer Systems, 25, 2007. [p. 37] [66] Stefan Fünfrocken. Transparent migration of Java-based mobile agents - capturing and reestablishing the state of Java programs. In Mobile Agents, pages 26–37. Springer-Verlag, 1998. [p. 85] [67] Steven E. Ganz, Daniel P. Friedman, and Mitchell Wand. Trampolined style. In International Conference on Functional Programming (ICFP), pages 18–27. ACM, 1999. [p. 17] [68] David Gay, Philip Levis, and David E. Culler. Software design patterns for TinyOS. ACM Transactions of Embedded Computer Systems, 6(4), 2007. [p. 36] [69] David Gay, Philip Levis, J. Robert von Behren, Matt Welsh, Eric A. Brewer, and David E. Culler. The nesC language: A holistic approach to networked embedded systems. In Programming Language Design and Implementation (PLDI), pages 1–11, 2003. [p. 36] [70] Simon J. Gay and Malcolm Hole. Types and subtypes for client-server interactions. In Programming Languages and Systems, European Symposium on Programming, pages 74–90, 1999. [p. 40] [71] Narain H. Gehani and William D. Roome. Concurrent C. Software—Practice and Experience, 16(9):821–844, September 1986. [p. 15] 122 BIBLIOGRAPHY [72] Anders Gidenstam and Marina Papatriantafilou. LFthreads: A lock-free thread library. In Eduardo Tovar, Philippas Tsigas, and Hacène Fouchal, editors, Principles of Distributed Systems (OPODIS), volume 4878 of Lecture Notes in Computer Science, pages 217–231. Springer, 2007. [p. 38] [73] Jean-Yves Girard. Linear logic. Theoretical Computer Science, 50:1–102, 1987. [p. 78] [74] Google Protocol Buffers. [p. 18] http://code.google.com/apis/protocolbuffers/. [75] Minor Gordon. Stage scheduling for CPU-intensive servers. PhD thesis, University of Cambridge, 2008. [p. 94] [76] Ralph E. Griswold. Programming with generators. The Computer Journal, 31(3), 1988. [p. 84] [77] Dan Grossman, Greg Morrisett, Trevor Jim, Michael Hicks, Yanling Wang, and James Cheney. Region-based memory management in Cyclone. Programming Language Design and Implementation (PLDI), 37(5):282–293, May 2002. [p. 79] [78] Rachid Guerraoui. Introduction to Reliable Distributed Programming. Springer, 2006. [79] Guice. http://code.google.com/p/google-guice/. [p. 33] [p. 112] [80] Brian Hackett and Alex Aiken. How is aliasing used in systems software? In ACM SIGSOFT international symposium on Foundations of software engineering, pages 69–80. ACM, 2006. [p. 76] [81] Philipp Haller and Martin Odersky. Event-Based Programming without Inversion of Control. In Proc. Joint Modular Languages Conference, LNCS, 2006. [p. 31] [82] Philipp Haller and Martin Odersky. Capabilities for External Uniqueness. Technical report, EPFL, 2009. [p. 82] [83] Per Brinch Hansen. The Architecture of Concurrent Programs. Prentice-Hall, 1977. [pp. 15 and 37] [84] Per Brinch Hansen. Java’s insecure parallelism. ACM SIGPLAN Notices, 34(4):38–45, 1999. [pp. 25 and 37] [85] Tim Harris and Keir Fraser. Language support for lightweight transactions. In ACM International Conference on Object-Oriented Programming, Systems and Applications (OOPSLA), pages 388–402, 2003. [p. 38] [86] Tim Harris, Simon Marlow, Simon L. Peyton Jones, and Maurice Herlihy. Composable memory transactions. Communications of the ACM, 51(8):91–100, 2008. [p. 38] [87] Carl Hewitt. Viewing control structures as patterns of passing messages. Artificial Intelligence, 8(3):323–364, 1977. [pp. 14, 16, and 35] [88] Michael Hind. Pointer analysis: haven’t we solved this problem yet? In ACM SIGPLANSIGSOFT Workshop on Program Analysis For Software Tools and Engineering (PASTE), pages 54– 61. ACM, 2001. [p. 77] [89] Hirzel, Von Dincklage, Diwan, and Hind. Fast online pointer analysis. ACM Transactions on Programming Languages and Systems (TOPLAS), 29, 2007. [p. 77] BIBLIOGRAPHY 123 [90] C. A. R. Hoare. An axiomatic basis for computer programming. Communications of the ACM, 12, 1969. [p. 78] [91] C. A. R. Hoare. Communicating sequential processes. Communications of the ACM, 21(8):666– 677, 1978. [pp. 14, 17, and 91] [92] John Hogg. Islands: Aliasing protection in object-oriented languages. In ACM International Conference on Object-Oriented Programming, Systems and Applications (OOPSLA), pages 271– 285, 1991. [p. 79] [93] John Hogg, Doug Lea, Alan Wills, Dennis de Champeaux, and Richard C. Holt. The Geneva convention on the treatment of object aliasing. OOPS Messenger, 3(2):11–16, 1992. [pp. 49 and 76] [94] Gerard J. Holzmann. The SPIN Model Checker. Pearson Education, 2003. [p. 14] [95] Blunt Jackson. NPTL: The New Implementation of Threads for Linux. Dr. Dobb’s Journal, Aug 2005. [p. 90] [96] JavaFlow: Apache Commons project for Java continuations. commons/sandbox/javaflow. [97] Jetty webserver. http://jakarta.apache.org/ http://www.mortbay.org/jetty/. [98] Jif: Java + information flow. http://www.cs.cornell.edu/jif/. [pp. 85, 89, and 90] [p. 104] [p. 113] [99] Erik Johansson, Konstantinos F. Sagonas, and Jesper Wilhelmsson. Heap architectures for concurrent languages using message passing. In Proc. of The Workshop on Memory Systems Performance and The International Symposium on Memory Management (MSP/ISMM), pages 195–206, 2002. [p. 38] [100] JSR 133. Java memory model and thread specification. http://jcp.org/en/jsr/detail?id=133. [p. 26] [101] JSR-308 Annotations on Java Types. http://jcp.org/en/jsr/detail?id=308. [102] Kamaelia: Concurrency made useful, fun. [103] Kilim. http://kilim.malhar.net. http://www.kamaelia.org/Home. [p. 77] [pp. 36 and 112] [pp. 16 and 111] [104] Gerwin Klein and Tobias Nipkow. A machine-checked model for a Java-like language, virtual machine, and compiler. ACM Transactions on Programming Languages and Systems (TOPLAS), 28(4):619–695, 2006. [p. 55] [105] Naoki Kobayashi. Quasi-linear types. In Principles of Programming Languages (POPL), pages 29–42, 1999. [p. 78] [106] Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. Frans Kaashoek. The Click modular router. ACM Transactions on Computer Systems, 18(3):263–297, 2000. [pp. 33 and 92] [107] Maxwell N. Krohn, Eddie Kohler, and M. Frans Kaashoek. Events can make sense. In USENIX Annual Technical Conference, pages 87–100, 2007. [pp. 16 and 25] 124 BIBLIOGRAPHY [108] Eugene Kuleshov. Using ASM framework to implement common bytecode transformation patterns. In Conference on Aspect-Oriented Software Development. Springer-Verlag, 2007. [p. 90] [109] Yves Lafont. Introduction to linear logic. In Lecture notes for the Summer School on Constructive Logics and Category Theory, August 1988. [p. 78] [110] Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Computers, 28(9):690–691, 1979. [p. 26] [111] Butler W. Lampson and David D. Redell. Experience with processes and monitors in Mesa. Communications of the ACM, 23(2):105–117, February 1980. [pp. 15, 36, and 37] [112] Peter J. Landin. A generalization of jumps and labels. Higher-Order and Symbolic Computation, 11(2):125–143, 1998. [p. 84] [113] James R. Larus and Michael Parkes. Using cohort scheduling to enhance server performance. In USENIX, editor, USENIX Annual Technical Conference. USENIX, 2002. [p. 94] [114] Hugh C. Lauer and Roger M. Needham. On the duality of operating system structures. Operating Systems Review, 13(2):3–19, 1979. [115] Doug Lea. A Java fork/join framework. In Java Grande, pages 36–43, 2000. [p. 89] [116] Gary T. Leavens and Yoonsik Cheon. Design by contract with JML. Draft, available from jmlspecs.org., 2005. [p. 113] [117] Edward A. Lee. The problem with threads. IEEE Computer, 39(5):33–42, 2006. [pp. 14, 16, 24, 32, and 34] [118] Xavier Leroy. Java bytecode verification: algorithms and formalizations. Journal of Automated Reasoning, 2003. [pp. 54, 55, and 90] [119] Tim Lindholm and Frank Yellin. The Java Virtual Machine Specification. Addison-Wesley, 1999. [p. 41] [120] Anil Madhavapeddy. Creating High-Performance, Statically Type-Safe Network Applications. PhD thesis, University of Cambridge, 2007. [p. 17] [121] Jeremy Manson, William Pugh, and Sarita Adve. The Java memory model. Principles of Programming Languages (POPL), 40(1), January 2005. [p. 26] [122] Mark Marron, Deepak Kapur, Darko Stefanovic, and Manuel V. Hermenegildo. A static heap analysis for shape and connectivity: Unified memory analysis: The base framework. In George Almási, Calincaval, and Peng Wu, editors, LCPC, volume 4382 of Lecture Notes in Computer Science, pages 345–363. Springer, 2006. [p. 78] [123] Bard Bloom Martin Hirzel, Nathaniel Nystrom and Jan Vitek. Matchete: Paths through the pattern matching jungle. In Practical Aspects of Declarative Languages, 2008. [p. 18] [124] Robin Milner. Communicating and Mobile Systems: the π-calculus. Cambridge University Press, 1999. [pp. 14, 16, 17, and 43] BIBLIOGRAPHY 125 [125] Apache MINA. [p. 18] http://mina.apache.org/. [126] Jayadev Misra and William Cook. Computation orchestration: A basis for wide-area computing. In Journal of Software and Systems Modeling, 2007. [pp. 109 and 112] [127] Greg Morrisett, David Walker, Karl Crary, and Neal Glew. From System-F to typed assembly language. ACM Transactions on Programming Languages and Systems, 21(3):527–568, May 1999. [p. 81] [128] David Mosberger and Tai Jin. httperf—a tool for measuring web server performance. SIGMETRICS Perform. Eval. Rev., 26(3):31–37, 1998. [p. 104] [129] Peter Müller and Arsenii Rudich. Ownership transfer in universe types. In ACM International Conference on Object-Oriented Programming, Systems and Applications (OOPSLA), pages 461– 478. ACM, 2007. [p. 80] [130] Andrew C. Myers. JFlow: Practical mostly-static information flow control. In Principles of Programming Languages (POPL), pages 228–241, 1999. [p. 113] [131] Francesco Zappa Nardelli, Peter Sewell, Jaroslav Ševčík, Susmit Sarkar, Scott Owens, Luc Maranget, Mark Batty, and Jade Alglave. Relaxed memory models must be rigorous. In Workshop on Exploiting Concurrency Efficiently and Correctly (EC2 ), 2009. [p. 27] [132] Flemming Nielson, Hanne Riis Nielson, and Chris Hankin. Principles of Program Analysis. Springer, 1999. [p. 54] [133] James Noble, Jan Vitek, and John Potter. Flexible alias protection. In Eric Jul, editor, European Conference on Object-Oriented Programming (ECOOP), volume 1445 of Lecture Notes in Computer Science, pages 158–185. Springer, 1998. [p. 80] [134] Occam-pi programming language. http://www.cs.kent.ac.uk/projects/ofa/kroc. [pp. 15 and 36] [135] Patrick E. O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth J. O’Neil. The log-structured merge-tree (LSM-tree). Acta Informatica, 33, 1996. [p. 106] [136] Orc: Orchestration language. http://orc.csres.utexas.edu/. [p. 85] [137] John K. Ousterhout. Why threads are a bad idea (for most purposes), 1996. Presentation given at the 1996 Usenix Annual Technical Conference, January 1996. [p. 25] [138] John K. Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières, Subhasish Mitra, Aravind Narayanan, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and Ryan Stutsman. The case for RAMClouds: Scalable high-performance storage entirely in DRAM. Technical report, Stanford University, 2009. [p. 34] [139] Krzysztof Palacz, Jan Vitek, Grzegorz Czajkowski, and Laurent Daynès. Incommunicado: efficient communication for isolates. In ACM International Conference on Object-Oriented Programming, Systems and Applications (OOPSLA), pages 262–274, 2002. [p. 39] 126 BIBLIOGRAPHY [140] Matthew M. Papi. Practical pluggable types for Java. Master’s thesis, Massachusetts Institute of Technology, 2008. [pp. 19 and 77] [141] David Pariag, Tim Brecht, Ashif Harji, Peter Buhr, and Amol Shukla. Comparing the performance of web server architectures. In Proc. of EuroSys, 2007. [p. 103] [142] Matthew J. Parkinson. Local reasoning for Java. Technical Report 654, University of Cambridge Computer Laboratory, November 2005. [p. 78] [143] Thomas Martyn Parks. Bounded scheduling of process networks. PhD thesis, University of California, Berkeley, 1995. [p. 35] [144] Greg Pettyjohn, John Clements, Joe Marshall, Shriram Krishnamurthi, and Matthias Felleisen. Continuations from generalized stack inspection. In International Conference on Functional Programming (ICFP), pages 216–227, 2005. [p. 90] [145] Benjamin C. Pierce. Types and Programming Languages. MIT Press, 2002. [p. 54] [146] Dan Pritchett. BASE: An ACID alternative. ACM Queue, 6(3):48–55, 2008. [p. 15] [147] William Pugh. The Java memory model is fatally flawed. Concurrency: Practice and Experience, 12(6):445–455, May 2000. [p. 26] [148] Ganesan Ramalingam. The undecidability of aliasing. ACM Transactions on Programming Languages and Systems (TOPLAS), 16(5):1467–1471, 1994. [pp. 25, 50, 57, and 77] [149] John H. Reppy. CML: A higher-order concurrent language. In Programming Language Design and Implementation (PLDI), pages 293–305, 1991. [p. 91] [150] John C. Reynolds. The discoveries of continuations. 6(3/4):233–247, November 1993. Lisp and Symbolic Computation, [p. 85] [151] John C. Reynolds. Separation logic: A logic for shared mutable data structures. In IEEE Symposium on Logic in Computer Science (LICS), pages 55–74, 2002. [p. 78] [152] RIFE web application framework. http://rifers.org. [pp. 85 and 90] [153] Peter Van Roy and Seif Haridi. Concepts, Techniques, and Models of Computer Programming. MIT Press, 2004. [p. 36] [154] Mooly Sagiv, Thomas Reps, and Reinhard Wilhelm. Solving shape-analysis problems in languages with destructive updating. ACM Transactions on Programming Languages and Systems (TOPLAS), 20(1):1–50, January 1998. [pp. 57, 61, 62, 69, 77, and 78] [155] Alexandru Salcianu and Martin C. Rinard. A Combined Pointer and Purity Analysis for Java Programs. In MIT Technical Report MIT-CSAIL-TR-949, 2004. [p. 77] [156] Scala programming language. http://www.scala-lang.org. [p. 31] [157] Jan Schäfer and Arnd Poetzsch-Heffter. Coboxes: Unifying active objects and structured heaps. In Gilles Barthe and Frank S. de Boer, editors, IFIP International Conference on Formal Methods for Open Object-based Distributed Systems (FMOODS), volume 5051 of Lecture Notes in Computer Science, pages 201–219. Springer, 2008. [p. 95] BIBLIOGRAPHY 127 [158] Douglas Schmidt and Tim Harrison. Double-checked locking: An optimization pattern for efficiently initializing and accessing thread-safe objects. In Third Annual Pattern Languages of Program Design Conference, 1996. [p. 26] [159] Fred B. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys, 22, 1990. [p. 14] [160] Siege: http load testing and benchmarking. http://www.joedog.org/index/siege-home. [p. 104] [161] Frederick Smith, David Walker, and J. Gregory Morrisett. Alias types. In Programming Languages and Systems, European Symposium on Programming, volume 1782 of LNCS, pages 366– 381, 2000. [p. 81] [162] Yee Jiun Song, Marcos Aguilera, Ramakrishna Kotla, and Dahlia Malkhi. RPC Chains: Efficient client-server communication in geodistributed systems. In Symposium on Networked Systems Design and Implementation (NSDI), 2009. [p. 17] [163] Lawrence Spracklen and Santosh G. Abraham. Chip multithreading: Opportunities and challenges. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture, 2005. [p. 33] [164] Jesper Honig Spring, Jean Privat, Rachid Guerraoui, and Jan Vitek. Streamflex: highthroughput stream programming in Java. In Richard P. Gabriel, David F. Bacon, Cristina Videira Lopes, and Guy L. Steele Jr, editors, ACM International Conference on ObjectOriented Programming, Systems and Applications (OOPSLA), pages 211–228. ACM, 2007. [p. 81] [165] Sriram Srinivasan. A thread of one’s own. In Workshop on New Horizons in Compilers, 2006. [p. 90] [166] Sriram Srinivasan and Alan Mycroft. Kilim: Isolation-typed actors for Java; a million actors, safe zero-copy communication). In European Conference on Object-Oriented Programming (ECOOP), 2008. [167] State Threads library for internet applications. http://state-threads.sourceforge.net/. [p. 98] [168] Bjarne Steensgaard. Points-to analysis in almost linear time. In Principles of Programming Languages (POPL), pages 32–41, 1996. [p. 77] [169] Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. The end of an architectural era: (it’s time for a complete rewrite). In VLDB ’07: Proceedings of the 33rd international conference on Very large data bases, pages 1150–1160, 2007. [p. 39] [170] Christopher Strachey and Christopher P. Wadsworth. Continuations: A mathematical semantics which can deal with full jumps. Monograph PRG-11, Oxford University Computing Laboratory, Programming Research Group, 1974. [p. 84] [171] Daniel C. Sturman. Modular Specification of Interaction Policies in Distributed Computing. Ph.D. thesis, Dept. of Computer Science, University of Illinois, Urbana, 1996. [p. 112] 128 BIBLIOGRAPHY [172] Herb Sutter. The free lunch is over: a fundamental turn toward toward concurrency. Dr. Dobb’s Journal, March 2005. [173] Herb Sutter and James R. Larus. Software and the concurrency revolution. ACM Queue, 3(7):54–62, 2005. [p. 24] [174] Wei Tao. A portable mechanism for thread persistence and migration. PhD thesis, University of Utah, 2001. [p. 90] [175] Hidehiko Masuhara Tatsurou Sekiguchi and Akinori Yonezawa. A simple extension of Java language for controllable transparent migration and its portable implementation. In Coordination Models and Languages, pages 211–226. Springer-Verlag, 1999. [p. 85] [176] Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. Building portable thread schedulers for hierarchical multiprocessors: The BubbleSched framework. In Euro-Par, pages 42–51, 2007. [p. 94] [177] Mads Tofte and Jean-Pierre Talpin. Region-based memory management. Information and Computation, 132(2):109–176, 1997. [p. 79] [178] Eddy Truyen, Bert Robben, Bart Vanhuate, Tim Coninx, Wouter Joosen, and Pierre Verbaeten. Portable support for transparent thread migration in Java. In Second International Symposium on Agent Systems and Applications and Fourth International Symposium on Mobile Agents, pages 29–43. Springer-Verlag, 2000. [p. 90] [179] Matthew S. Tschantz and Michael D. Ernst. Javari: adding reference immutability to Java. In ACM International Conference on Object-Oriented Programming, Systems and Applications (OOPSLA), pages 211–230, 2005. [p. 77] [180] Jan Vitek and Boris Bokowski. Confined types in Java. Software—Practice and Experience, 31(6):507–532, 2001. [p. 80] [181] J. Robert von Behren, Jeremy Condit, and Eric A. Brewer. Why events are a bad idea (for high-concurrency servers). In Hot Topics in Operating Systems (HotOS), pages 19–24, 2003. [pp. 16 and 25] [182] Robert von Behren, J. Condit, F. Zhou, George Necula, and E. Brewer. Capriccio: Scalable threads for internet services. In Symposium on Operating Systems Principles (SOSP), 2003. [p. 89] [183] Philip Wadler. Is there a use for linear logic? In Partial Evaluation and Program Manipulation (PEPM), pages 255–273, 1991. [p. 78] [184] David Walker and Greg Morrisett. Alias types for recursive data structures. In ACM SIGPLAN Workshop on Types in Compilation (TIC) , Montral, Qubec , September, 2000, volume 2071, pages 177–206. Springer-Verlag, 2001. [p. 82] [185] Peter Welch. JCSP: Java Communicating Sequential Processes. projects/ofa/jcsp. http://www.cs.kent.ac.uk/ [p. 91] BIBLIOGRAPHY 129 [186] Matt Welsh, David Culler, and Eric Brewer. SEDA: an architecture for well-conditioned, scalable Internet services. Operating Systems Review, 35(5):230–243, December 2001. [pp. 17, 91, and 94] [187] John Whaley and Monica S. Lam. Cloning-based context-sensitive pointer alias analysis using binary decision diagrams. In Programming Language Design and Implementation (PLDI), pages 131–144, 2004. [p. 77] [188] Phil Winterbottom, Sean Dorward, and Rob Pike. The Limbo Programming Language. In Proceedings of Compcon 97, 1997. [p. 15] [189] Byung-Sun Yang, Soo-Mook Moon, and Kemal Ebcioğlu. Lightweight monitors for the Java virtual machine. Software—Practice and Experience, 35(3):281–299, 2005. [p. 38] [190] Tian Zhao, Jason Baker, James Hunt, James Noble, and Jan Vitek. Implicit ownership types for memory management. Science of Computer Programming, 71(3):213–241, 2008. [p. 81] [191] Tian Zhao, Jens Palsberg, and Jan Vitek. Type-based confinement. Journal of Functional Programming, 16(1):83–128, 2006. [p. 80]