Tail-Recursive Stack Disciplines for an Interpreter Richard A. Kelsey NEC Research Institute kelsey@research.nj.nec.com 8 March 1993 Abstract Many languages, including Scheme, ML, and Haskell, require that their implementations support tail-recursive calls to arbitrary depth. This requirement means that a traditional stack discipline cannot be used for these languages. This paper examines several di erent methods of implementing proper tail recursion in a stack-based interpeter, including passing arguments in a heap, copying arguments to tail-recursive calls, and garbage collecting the stack. Benchmark timings and other run-time statistics are used to compare the di erent methods. The results show that using a stack is a good idea, and that the overhead of the interpreter largely overshadows the di erences in performance of the various stack disciplines. This is a slightly enhanced version of Technical Report NU-CCS-93-03, College of Computer Science, Northeastern University, 1992. 1 The Problem A programming language implementation is properly tail recursive if unbounded iterative computations that are expressed recursively can be executed in constant space. Implementations of languages such as Scheme [Rees 86], ML [Milner 88], and Haskell [Hudak 90] are required to be tail recursive. These languages also require that some lexical environments have inde nite extent. That is, arguments passed to a procedure may need to be preserved until long after the procedure has returned. Similarly, in Scheme, continuations also may have inde nite extent. A continuation is all of the information that is saved across a procedure call and used to restart the procedure's caller once the procedure returns. In Scheme procedure invocations may return more than once, so continuations may need to be saved for later use. 1 Why do I, normally a compiler hacker, want to use interpreters to implement such languages? There are several advantages of an interpreter over a native-code compiler. Interpreters are much more portable and usually provide much better debugging information for the user. The interpreter I used for the experiments reported here is a C program that can be run without modi cation on a large number of di erent types of machines. An interpreter usually does a minimal amount of processing of a source program before execution. This makes it easy to relate an interpreter's internal state to the source program in a way that is comprehensible to the user. Interpreters should be simple, fast, and provide good debugging information for the user. I will judge the stack disciplines on how well they support these goals. The best stack discipline might be not to have a stack at all. It is advocated in [Appel 87] that with a very ecient garbage collector and a good compiler, heaps may be more ecient than stacks. With an interpreter this appears not to be the case. Making ecient use of a heap for procedure calls requires more sophisticated compilation than is usual for interpreters. When not using a stack, and with an ecient garbage collector, the benchmarks used in this paper ran 13% more slowly than the slowest of the stack implementations, and 27% more slowly than the fastest. 2 Why a traditional stack won't work Stacks are the standard way to implement continuations and lexical environments [Aho 86]. To make a procedure call, the arguments are pushed on the stack followed by the saved program counter, environment pointer and whatever else will be needed once the procedure returns. In the called procedure, the pushed arguments become the new lexical environment and the saved state of the caller is the current continuation. The environment and continuation together are often refered to as an `activation record'. When the procedure returns, the environment and the continuation are removed from the stack. Figure 1 shows how such a stack might look just as control transfers to the body of procedure G and again at the beginning of F in the following Scheme example: (DEFINE FOO (LAMBDA (A D) (F A B (G C D)))) The interpreter is assumed to have three registers: a program counter *PC*; the current environment *ENV*; and the current continuation *CONT*. All three of these are saved when a new continuation is created. The environment includes a pointer to the called procedure's saved lexical environment to be used for reaching variables bound in outer lexical contours. 2 *ENV* *CONT* *ENV* *CONT* Size 5 *PC* *CONT* Size 3 Continuation to call to G (includes *PC* *ENV* *CONT* Size 3 *ENV* Access Link D Size 4 Environment for G Access Link (G C D) C B A *CONT* A Size ? *PC* Continuation to call to FOO *CONT* *ENV* *ENV* Size 3 Size 3 Access Link D Environment for F B the first two arguments to F) Size ? *PC* Continuation to call to F Access Link Environment for FOO D A Continuation to call to FOO Environment for FOO A Figure 1: Examples of traditional stack usage The standard stack discipline is not an obstacle to implementing longlived environments or continuations. In most cases, the environment and continuation passed to a procedure are removed from the stack when the procedure returns. If they need to be preserved they may be copied to a heap [Clinger 88]. Copying requires that each environment and continuation include an indication of its size. For continuations, this may be either unnecessary, if continuations need not be preserved, or it may be found at a known o set from the saved program counter and not take up any space on the stack. The diculty with the standard stack discipline is that it does not implement proper tail recursion. When control transfers to the body of F the stack contains the environment and the continuation of the call to FOO. But since the call to F is tail recursive, these do not contain any necessary information. The values in the environment will never be used again. The continuation will just remove itself and the environment from the stack and return to its saved value of *CONT*. After a series of such calls the stack may over ow, even though it contains very little useful data. The traditional stack discipline eventually recovers the stack space used for a procedure call, but in the case of tail-recursive loops, it recovers it too late. Over owing the stack in this fashion puts a limit on the number of times a recursively expressed iterative loop can execute. This is not a problem in 3 languages such as C, Pascal, or FORTRAN which include various forms of iterative control constructs. For languages that provide recursion, but not direct iteration, environments and continuations must be implemented in some other fashion. Dealing with the continuations is easy. Simply do not make a new continuation for a tail-recursive call, and reuse the current continuation instead. In the example, the call to F does not require a new continuation and could use the one passed to FOO. The real problem is the environments. A tailrecursive call simultaneously creates a new environment and makes the old one obsolete. On a stack, the storage used for the old environment is lost, as it lies beneath the new environment. 3 Methods for implementing proper tail recursion I will consider three general approaches to implementing proper tail recursion on a stack. 1. Use the heap instead of a stack One obvious solution, since there has to be a heap for some environments and continuations, is just to put the environments there in the rst place. Unused heap storage will eventually be reclaimed, which allows for an unbounded number of tail-recursive calls. [Appel 87] advocates placing all environments, as well as all continuations, in the heap. 2. Overwrite the environment during tail-recursive calls It is also possible to implement proper tail recursion and still use a stack for the environments. If no new continuation were created for the call to F and its arguments overwrote the environment of FOO, the call could be made with no unnecessary data on the stack. Any number of such calls could be made without over owing. There are two possibilities here. The current environment could be incrementally overwritten as the new arguments are produced, as in [Kranz 88, Hanson 90], or the arguments could be stored in temporary locations and then copied on top of the current environment once they are all available, also in [Hanson 90, McDermott 80]. The rst is usually more ecient, as a clever implementation can get away with moving fewer values, and the second is clearly simpler, as data dependencies between the values in the environment and the arguments can be ignored. In the worst case, where every element of the environment is needed to compute each argument, the rst method degenerates to the second. There is a diculty with incrementally overwriting the current environment that makes it unsuitable for an interpreter. If an error occurs during the evaluation of an argument, the calling procedure's environment will be partially overwritten by the arguments to the called procedure. This makes 4 *ENV* *CONT* Size 3 Access Link D Environment for G C Size 5 *ENV* *PC* *CONT* *ENV* Continaution for G (including the first two arguments to F) *CONT* B A A Access Link D Access Link Access Link Environment For FOO D Size ? Size ? *PC* *PC* *ENV* Size 4 A A *CONT* Arguments to F Size 3 Environment for FOO Continuation to call to FOO *CONT* (G C D) B Size 3 *ENV* (G C D) Environment for F B A Size ? Continuation to call to FOO *CONT* *ENV* *PC* *CONT* Continuation to call to FOO now used for F *ENV* Figure 2: The stack with environments and continuations switched: just before jumping to G; after G returns; and after overwriting the environment of FOO providing coherent debugging information to the user very dicult. The goals of simplicity and maximal debugging information become mutually exclusive. Delaying the overwriting until all of the arguments have been produced avoids this problem, so from here on I will only consider that alternative. To avoid complications, the stack usage would have to be changed so that continuations to calls were pushed before the arguments, rather than afterwards. Figure 2 shows how such a stack would appear at various times during the execution of the example. A similar calling sequence is presented in [Steele 77]. Otherwise a procedure would sometimes be called with its arguments on the top of the stack, while at other times there would be a continuation above the arguments. 3. Garbage collect the stack Another possibility is to wait until the stack is full and then reclaim any unused space. The stack can be treated as a constrained heap, containing only environments and continuations. With this method many environments can be reclaimed without cost when procedures return. The additional cost 5 of garbage collection matters only in the relatively rare instance of the stack actually lling up. Garbage collecting the stack is simple as pointers between continuations and environments follow a very regular pattern. There are at most three pointers into the stack from the outside: the stack pointer, the control link, and the current environment. In a highly optimized implementation there may only be the stack pointer, with the other two pointers being implicit. Continuations contain a control link which points into the stack, and a saved environment pointer that may or may not point into the stack. Again, in a complex implementation the control link may be implicit. The access link of environments on the stack may also point into the stack. There are no other pointers into the stack. Control links always point to continuations, and saved environment pointers and access links always point to environments. Thus the stack garbage collector never needs to check the type of an object. There are at least three strategies for a stack garbage collector. The stack could be compacted in place, it could be copied to another stack, or it could be copied into a heap. The rst possibility, compacting the stack in place, can be implemented using a pointer reversal algorithm as long as the compiler maintain certain invariants. The requirement is that pointers to environments on the stack must point to the next highest used environment; they cannot skip over any environment that is still in use. This is not dicult for the compiler to do { indeed generating correct code that violates this requires some program analysis { but it does eliminate certain potential optimizations. The second method, copying the live data into a second stack, does not restrict the compiler as the additional space allows the use of forwarding pointers. It could result in thrashing if a program pushed a large quantity of live data and then repeatedly caused the stack to over ow. A system similar to that described in [Hieb 90] for minimizing the cost of saving continuations could be used to reduce or avoid copying data multiple times. The third possibility, copying the live data into a heap, reduces the possibility of thrashing. It is also simpler as it avoids the necessity of preserving the relative positions of the environments and continuations. There is the additional overhead of copying continuations back onto the stack as they are needed (the environments might as well remain in the heap). Checking for stack under ow during procedure returns can be avoided by placing a special continuation at the base of the stack which copies the next continuation from the heap onto the stack. Locality considerations might keep the cost of copying fairly low, as argued in [Clinger 88]. This is certainly the most attractive strategy for languages that already require a heap. 6 4 Saving Environments All of the languages which were given as examples of properly tail-recursive languages are also lexically scoped and allow procedures to escape upwards. This means that a local environment may be needed long after the return of the call that created it. When it is determined that an environment may outlive the call that created it, the environment can be preserved by a sort of sub-garbage-collection, which migrates the environment (and its superior environments if they are also on the stack) from the stack to the heap. The decision to save an environment on the heap may be made at compile time or by a check at run time. The same restrictions on pointers into the stack that make the full garbage collection simple are useful here as well. The cost of migrating an environment to the heap is linear in the size of the environment and the number of continuations that reference it; the size of the rest of the stack does not matter. Compile-time analysis and run-time checks can both be used to reduce the number of environments that are migrated to the heap, as reported in [McDermott 80, Kranz 88, Hanson 90] and elsewhere. The ability to garbage collect the stack increases the number of environments that can be allocated on the stack, even for a compiler that normally overwrites the current lexical environment when making tail-recursive calls. Environments that would not be allowed on the stack according to a strict interpretation of proper tail recursion can be stack-allocated and migrated in the unlikely event of a stack over ow. Using a compiler that allows destructive reuse of environments will greatly decrease the frequency of the already rare stack over ows. 5 Analysis The three possibilities are: put all environments in the heap, move the arguments to tail-recursive calls, or garbage-collect the stack. What are the relative costs? For brevity, I will call the three disciplines heap-envs, move-args, and stack-gc. A fourth option, not to use a stack at all, will be refered to as no-stack. Cost of making continuations Continuations are identical in the three stack disciplines, with one minor exception. In heap-envs the stack contains only continuations, since the environments are all in the heap. With no environments to interfere, the current continuation will always be found on the top of the stack. The *CONT* register can be eliminated, and no longer needs to be saved in continuations. This saves some work when continuations are created, and again when procedures return. Continuations have an additional cost for no-stack because they require copying values into and out of the heap. Avoiding this copying would require 7 signi gantly more complex preprocessing of the user's program. Cost of procedure calls Here I need to distinguish between two possible implementations of heap-envs. The sequence for creating a new environment for a call may be either (method A) 1. Create a new heap environment 2. Put the arguments in the environment as they are created or (method B) 1. Push all the arguments on the stack 2. Create a new heap environment 3. Copy the arguments from the stack to the environment Method A does not require copying each argument, but it does require a new register to hold the new environment while the arguments are being evaluated. This register must be saved in continuations. In a sense it replaces the *CONT* register. The other methods need the *CONT* register because an environment may be on the top of the stack, while Method A requires a *NEW-ENV* register because the new environment cannot be on top of the stack. Method A does not work for Scheme. In Scheme, use of call-withcurrent-continuation may cause procedure calls to return more than once. If the environment were created before the arguments were generated, and one of the arguments was produced by a call that returned twice, there would be no place to put the second version of the argument. It cannot replace the rst value in the environment as both values may be needed by the program. With Method B, the code that created the environment for the call would be executed twice, creating two di erent environments. no-stack and heap-envs create all environments in the heap, with method B of heap-envs also requiring that every argument be moved once. move-args and stack-gc create all environments on the stack, with moveargs requiring that every argument to a tail recursive call be moved once. For move-args and stack-gc, when an environment needs to be preserved for future use, that environment must be copied into the heap. In no-stack and heap-envs this is unnecessary. Garbage collection costs heap-envs puts environments in the heap that the other methods leave on the stack. no-stack puts everything in the heap. These will result in more frequent heap garbage collections. The stack-gc method will cause occasional stack garbage collections. These may either be in place, or operate by copying all of the currently live 8 bubblesort quicksort tak tak2 cpstak cpstak2 b towers matrix-mult matrix-mult2 length no heap move stack stack env args gc Description 2.44 1.97 2.94 1.73 3.30 2.07 2.82 1.90 4.46 2.08 1.05 2.15 1.74 2.38 1.44 2.93 1.97 2.40 1.57 3.91 2.06 1.01 1.99 1.70 2.00 1.24 3.10 2.31 2.15 1.31 3.43 2.03 0.91 1.83 1.64 1.98 1.20 2.92 2.14 2.15 1.29 3.36 1.99 0.87 bubble sort on 200 integers quicksort on 1500 integers Takeuchi " with inlined call to not Takeuchi in CPS " with inlined call to not Recursive Fibonnaci Towers of Hanoi Integer matrix multiply " with inlined procedures Length of a list Figure 3: Benchmark times in seconds running a DEC5000 workstation data on the stack into the heap. The second method is equivalent to preserving the current continuation for later use. This is exactly what is required to implement the Scheme procedure call-with-current-continuation, so for Scheme the stack garbage collector does not involve any additional code. The bottom line There is no analytical bottom line. The relative costs of the di erent disciplines depend on the behavior of the executed code. If the stack never over ows and no environments need to be preserved, then stack-gc is clearly the fastest, as it would not copy any arguments or have any other overhead. If every environment needed to be preserved, then heap-envs would be fastest, as it would not have to copy any arguments (for the rst method). move-args will likely be slower than stack-gc, because of the additional argument copying it involves, but it is simpler than stack-gc, and might be worthwhile if the di erence in speed were not much. Unless the code makes frequent use of call-with-current-continuation, no-stack is likely to be slow due to increased garbage collection overhead and higher costs for continuations. 6 Experimental Results I have implemented the three stack disciplines, along with a heap-only implementation, in Scheme48, which is based on a byte-code interpreter. A very simple compiler, which performs almost no optimizations, is used to compile Scheme programs into byte-codes. Scheme48 runs these benchmarks at 35% of the speed of Chez Scheme, a commercial Scheme implementation 9 Conts # words bubblesort 50 201 quicksort 32 130 tak 111 493 tak2 48 239 cpstak 64 254 cpstak2 0 0 b 93 417 towers 57 254 matrix-mult 85 477 matrix-mult2 1 6 length 0 0 Envs # words 111 263 102 243 127 382 64 254 175 541 111 413 93 185 90 254 143 509 31 90 50 150 Closed Tail # words # words 1 1 41 41 22 50 54 61 0 0 16 48 0 0 16 48 48 143 111 302 48 143 111 302 0 0 0 0 0 0 8 25 3 6 58 86 3 6 29 57 0 0 50 100 Figure 4: The number of and total number of words used in environments, closed-over environments, and tail-recursive calls while running the benchmarks; all numbers are in 1000s that uses a native code compiler (this number is somewhat unfair as the two systems were not running in identical modes; it does show that Scheme48 is not unreasonably slow). The results of running several benchmarks are shown in gures 3 and 4. With the exception of length and cpstak, the benchmarks were taken from [Kranz 88] and modi ed to run in Scheme instead of T. length is a simple loop in which all of the calls are tail recursive. cpstak was written by Will Clinger. The benchmarks were run with a heap size of one megabyte and a 40,000 byte stack. All methods required heap garbage collections for the two cpstak benchmarks. Only no-stack and heap-envs caused heap garbage collections for the other benchmarks. These garbage collections accounted for no more than ve percent of the time used, and usually much less. Live data accounted for less than two percent of the total heap space. With more live data or a slower garbage collector the relative runtimes could be very di erent. For example, in running bubblesort, no-stack used a total of 2.5 megabytes of heap space, while move-args and stack-gc used 8K bytes. Only cpstak and cpstak2 caused stack-gc to over ow the stack. The statistics require a little explanation. A word is the space required to store one pointer. The word totals for continuations, environments and closed-over environments do not include the space needed to store the sizes of the objects. The word totals for environments do include the space taken up by each environment's pointer to its lexically superior environment. The 10 number of environments is equal to the number of calls with at least one argument. For example, the gures for the length benchmark show that 50,000 calls were made, each of which passed two arguments (two arguments plus the pointer to the superior gives 150,000 words). There were 50,000 tail-recursive calls, so every call was tail recursive, and indeed had two arguments, as the tail-recursive calls totalled 100,000 arguments. Only on cpstak2 is stack-gc slower than another method. moveargs is almost as fast, running only three percent slower. heap-envs is eleven percent slower than stack-gc. Not using a stack at all is twentysix percent slower than stack-gc. These di erences make sense given the numbers in gure 4. Very few of the environments need to be preserved (the `Closed' column), and the majority of arguments are to non-tail-recursive calls. heap-envs was expected to prevail only if many environments had to be preserved, while stack-gc and move-args are identical except for move-args copying the arguments to tail-recursive calls. With relatively few tail-recursive calls and almost no preserved environments, heap-envs loses and stack-gc and move-args tie. 7 Comparison with other work The methods of implementing tail recursion with a stack that I have described here have been discussed in [Hanson 90, McDermott 80, Appel 87, Kranz 88], usually in the context of a native-code compiler. Interpreters require a di erent set of priorities. Simple translation and preserving debugging information are much more important than with native-code compilers. The inherent ineciency of interpretation a ects the relative eciencies of di erent techniques. For example, much of the concern in [Hanson 90] and [Kranz 88] is with trying to get the maximum number of environments on the stack. In the benchmarks in this paper, the e ects of not putting any environments on the stack are not all that great. In [Clinger 88] di erent stack implementations are compared in terms of implementing continuations with unlimited extent. In terms of implementing explicit continutions, the no-stack version here corresponds to their `garbage collection strategy', and the others are all variations on their `incremental stack/heap strategy'. None of the benchmarks in this paper made use of explicit continuations. 8 Conclusion The above results are disappointing. The original title of this paper was Stack Garbage Collection, and for most code the stack garbage collector does greatly reduce the amount of data that the interpreter copies when performing procedure calls. Unfortunately the benchmarks show that this 11 advantage is overshadowed by the relative ineciency of the Scheme48's byte-code interpreter. When compiled for a MIPS microprocessor, which uses a RISC architecture, byte-code dispatch takes ve machine instructions, and the 17 bytecodes executed in the loop of the length benchmark average 8.3 machine instructions apiece. The argument copying that accounts for the di erences is done by tight loops, using only a few instructions for every value copied. One iteration of the tail-recursive loop in length, which involves one procedure call with two arguments, requires around 180 instructions. The additional overhead, with heap-envs or move-args, of moving the two arguments does not matter enormously. On the positive side, the benchmarks do show the advantage of using some kind of stack. For Scheme, the presence of call-with-currentcontinuation requires code to move the stack into the heap, and that same code can be used for implementing proper tail recursion. For other languages, move-args is probably the best choice: it is nearly as fast, requires little code to implement, and does not put the burden on the heap garbage collector that move-args does. Acknowledgements Scheme48 was written by Jonathan Rees and myself. Jonathan Rees and Mitch Wand provided many helpful comments on earlier drafts of this paper. 12 References [Aho 86] Alfred V. Aho, Ravi Sethi, and Je ery D. Ullman. Compilers: Principles, Techniques, and Tools. Addison Wesley, 1986. [Appel 87] Andrew W. Appel. Garbage collection can be faster than stack allocation. In Information Processing Letters 25, 1987. [Clinger 88] William D Clinger, Anne H Hartheimer, and Eric M Ost. Implementation strategies for continuations. In Proceedings Record of the 1988 ACM Symposium on Lisp and Functional Programming, ACM, 1988. [Hanson 90] Chris Hanson. Ecient stack allocation for tail recursive languages. In Proceedings of the 1990 ACM Symposium on Lisp and Functional Programming, ACM, 1990. [Hudak 90] P. Hudak and P. Wadler (editors). Report on the programming language Haskell. Technical Report YALEU/DCS/TR-777, Department of Computer Science, Yale University, 1990. [Hieb 90] Robert Hieb, R. Kent Dybvig and Carl Bruggeman. Representing Control in the Presence of First-Class Continuations. In Proceddings of the ACM SIGPLAN '90 Conference on Programming Language Design and Implementation, ACM, 1990. [Kranz 88] David Andrew Kranz. ORBIT: An optimizing compiler for Scheme. Technical Report YALEU/DCS/TR-632, Department of Computer Science, Yale University, 1988. [McDermott 80] Drew McDermott. An ecient environment allocation scheme in an interpreter for a lexically-scoped Lisp. In Proceedings of the 1980 Lisp Conference, ACM, 1980 [Milner 88] R. Milner. A proposal for Standard ML. In Proceedings of the 1984 ACM Symposium on Lisp and Functional Programming, ACM, 1984. [Rees 86] Jonathan A. Rees and William Clinger, editors. Revised3 report on the algorithmic language Scheme. SIGPLAN Notices 21(12), pages 37{79, 1986. [Steele 77] Guy Lewis Steele Jr. Debunking the \expensive procedure call" myth. AI Memo 443, Arti cial Intellegence Laboratory, Massachusetts Institute of Technology, 1977. 13