Measuring kernel throughput on Blue Gene/P with the Plan 9 research operating system ∗† Ronald G. Minnich and John Floren Sandia National Labs Aki Nyrhinen University of Helsinki, Department of Computer Science October 12, 2009 Abstract We have ported the Plan 9 research operating system to the IBM Blue Gene/L and /P series machines. Part of our research is to answer the following question: can a full-featured operating system like Plan 9 equal the performance of a lightweight kernel such as IBM’s Compute Node Kernel (CNK)? We expected the CNK to provide better performance than Plan 9 in several areas, e.g., CNK supports 1 Mbyte pages and Plan 9 only uses 4096 byte pages. This page-size difference will, in turn, result in better performance for applications which have highly non-local memory reference patterns. It is critical that we be able to finely measure kernel performance. While such systems are taken for granted in the Unix world, to date, Plan 9 has only supported a profiling interface, not a tracing interface. Profiling interfaces can provide gross information about “where the time is spent” across all processes using the machine. In essence, a profiling interface removes all time and call tree information, providing only a summary. A tracing interface is able, for a given process, to show exactly what functions called each other, and when. We have created a Plan 9 trace device, devtrace, which can be used to selectively trace functions and processes in Plan 9. Users can enable a range of functions to be traced, observe which of the functions are called, in what order, what their parameters are, and the time spent (in CPU ticks) in each function. We have devloped a set of tools for plotting this data to ∗ † Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DEAC0494AL85000. SAND-2008-4108-P. This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DEAC02-06CH11357. 1 make the progression and timing of function calls clear. Since all Plan 9 file systems are user level processes, it is possible to trace a single process file I/O as it progresses from the process, through the file server processes, and to disk. This measurement, in turn, allows us to propose changes in the Plan 9 kernel design and implementation to improve performance. We have used this device to quantify the advantage of our 1M page implementation in Plan 9. We are further using it to optimize the I/O path from process to network. The implementation of the trace device went through several distinct phases. In the end, we arrived at a device with a textual interface. Users need not write programs to use the trace facility. The trace device does not rewrite kernel code and hence does not require priveleged access (as in Linux or Solaris). Any user of a Plan 9 system can easily measure their system’s performance. 1 Introduction This project started out as a simple question: where is the time going in Plan 9, and why? Our primary use of Plan 9 is in High Performance Computing (HPC) systems, in which overall throughput can depend on very small overheads that don’t much matter in desktop systems. Isolating problem overheads and removing them is a very common activity in HPC. Plan 9 is a very small, tightly crafted operating system. It has only 40 or so system calls. A given file system IO call will result in a call stack that is only a few levels deep, as opposed to the (literally) dozens of call levels found in, e.g., a Linux NFS I/O. Modifying Plan 9 system calls in simple ways is a far less daunting task than on other operating systems. Also, Plan 9 is very modular; the boundaries between components are well-defined and adhered to, with very little of the shared state that characterises most operating systems. This separation enables the inclusion of changes as long as they do not break the interfaces. Hence, it is very likely that, given the discovery of a major overhead that might be avoided by a straightforward redesign, the redesign can be incorporated in the kernel. To give some flavor of what the trace device allows, we show a real trace in Figure 1. The data and plot were created using devtrace and a script processing pipeline. The trace shows the kernel functions called by an ’echo’ command. The X axis is the time in processor ticks; the Y axis is each kernel function, by name. For readability, we have filtered out all kernel functions that take less than 50,000 ticks. The red line marks the time a function is entered until the time it exits. The calling hierarchy can be determined by seeing what lines overlap other lines. For example, we can see that the syspwrite function calls pwrite, as they cover the same part of the X axis, and syspwrite is wider. Functions called multiple times would appear multiple times on this graph (they have been filtered out for readability). As with most Plan 9 devices, the interface to devtrace is textual. There are two files: tracectl and trace. The tracectl file is used to both query the trace putseg freepte _strfmt closergrp closefgrp sysclose fdclose syspread read snprint sysopen namec devopen walk cclose chanfree pathclose addelem uniquepath copypath ewalk devwalk devgen parsename growparse kstrdup devattach free poolfree poolfreel newpath sprint vsnprint dofmt _fmtdispatch validnamedup validname0 syscall syspwrite write putstrn0 qwrite allocb _allocb msize poolmsize mallocz i8250interrupt uartkick i8250kick duppage copypage memmove trim trap faultamd64 fault fixfault pio ptealloc smalloc poolalloc poolallocl ’plotme’ using 1:2:3:ytic(5):xtic(1) Figure 1: A sample trace output with tracedev. The X axis is in units of processor ticks. Stride 1 101 201 301 401 501 601 701 801 901 1001 CNK 3.7539700e-01 3.7721600e-01 3.7926300e-01 3.8119100e-01 3.8263600e-01 3.8444400e-01 3.8661300e-01 3.8841900e-01 3.8986500e-01 3.9227500e-01 3.9444300e-01 Plan9 3.8000000e-01 1.6400000e+00 3.4000000e+00 4.9100000e+00 6.3300000e+00 7.7600000e+00 7.7700000e+00 7.7600000e+00 7.7600000e+00 7.7700000e+00 1.1750000e+01 Plan 9 (1MB page) 3.7000000e-01 3.8000000e-01 3.8000000e-01 3.8000000e-01 3.8000000e-01 3.9000000e-01 3.9000000e-01 3.9000000e-01 3.9000000e-01 3.9000000e-01 4.0000000e-01 Figure 2: Comparative performance of: CNK; Plan 9 with 4K pages; Plan 9 with 1 MB pages. Even with tracing on, Plan 9 is sometimes faster. device and control its actions; the trace file is used to read trace records. Example usage We wanted to measure the performance of SAXPY on the Blue Gene/P (BG/P) system at Argonne, on both the CNK and Plan 9 kernels. We used the strid3 benchmark from Lawrence Livermore National Labs. Strid3 does multiple SAXPY operations, with different strides each time. The benchmark “is designed to severely stress the memory subsystem on a node ...utilize[s] combinations of loops of scalar and vector operations and measure[s] the MFLOP rate delivered as a function of the memory access patterns and length of vector utilized”. Strid3 was of particular interest as it can expose problems with both TLB and cache because, at larger strides, it can invalidate both the TLB entry and a cacheline for each memory reference. Relative performance is shown in Figure 2; we use the time trace output from strid3. The initial results were not encouraging: as the strides grew larger, runs under Plan 9 showed a factor of 30 penalty over CNK. We suspected that the problem was our use of 4096-byte TLBs as opposed to the CNK’s use of 1 Mbyte TLBs. To get some idea of the number of faults for strid3, we counted them. To count the faults we had to set up a trace for the faultpower function in the kernel, which is always called on a TLB miss. Using nm we obtain the address range of faultpower and execute the following command: cpu% echo trace 0xf00049f4 0xf0004b38 new faults > /dev/tracectl This command creates a new trace, called faults, which covers the range selected by the two numbers. Traces are not enabled by default. We can create any number of traces, and have them enabled and disabled independently. To enable a trace: echo trace faults on > /dev/tracectl We also have to actually enable the trace device itself, with the start command: echo start > /dev/tracectl This setup is very similar to a logic analyzer setup. We create the triggers, enable some of them, then start the logic analyzer. After the strid3 run, we can see how many page faults occurred: cat /dev/tracectl which shows: logsize 16 trace f0004944 f0004a88 new fault #trace f0004944 traced? f16a33d0 trace fault on #tracehits 863700, in queue 863700 #tracelog f1909d90 #traceactive 0 #slothits 293803185 #traceinhits 131995046 watch 0 This output, as for most plan 9 devices, consists of a valid set of commands and comments (comments start with a #). The output can be used as input to the device. It is very easy to save the output and, later, rerun the trace. We discuss the commands later in this paper; for now, the main item of interest is ’traceinhits’, i.e. the number of times we had a hit on an active trace. It shows that strid3 took 132 million TLB miss faults. It’s certainly reasonable to guess that we might improve performance by reducing the number of faults. To do so, we need to implement one Mbyte TLB entries, a.k.a. “huge pages”. The way in which we implement this change is grist for another paper, but we can say that the careful addition of 12 lines of code is sufficient to add them. We perform the run again, with much better results. We can get immediate confirmation of this improvement from the trace output: logsize 13 trace f00049f4 f0004b38 new fault #trace f00049f4 traced? f169d4f0 trace fault on #tracehits 9918, in queue 9917 #tracelog f1894440 #traceactive 0 #slothits 18639 #traceinhits 9535 watch 0 E ffffffff8011aa68 000088297eecd2e7 000000000000023c 00000000000007be ffffffff81accb78 ffffffff81a4d2a8 ffffffff0000000a X ffffffff8011aac7 000088297eecd445 000000000000023c ffffffff801abc78 0000000000000000 0000000000000000 0000000000000000 Figure 3: Data output from /dev/trace We can see a 10,000-fold reduction in the number of calls to faultpower. The dramatic improvement in strid3 can be traced to this reduction. On an applicationby-application basis, we are using devtrace to wring out the performance issues in Plan 9 so that this full-featured kernel can equal the performance of a lightweight kernel. For this case, there was no need to look at the actual trace records themselves. We will describe the format here in brief. The trace records are also text, which facilitates cross-platform processing of the data: this interface is inherently heterogeneous. To read the trace records, we type: cat /dev/trace and can see the records shown in Figure 3. Figure 3shows two lines of output from the trace device. The format of the line is: • E or X indicating entry or exit • The PC • The processor time stamp counter, 64 bits • The process ID • E records: the first four arguments1 ; X records: the return value The rest of this paper is structured as follows: first, we provide an overview of related work; then we describe the current implementation and the postprocessing tools that we have developed to analyze the traces. Finally, we close with a description of future work. 2 Related work There have been many implementations of kernel tracing facilities over the years. Trace systems generally have one of three goals: debugging, which requires a tremendous amount of flexibility; tracing calls and times, which requires recording function invocation, arguments, and timing; and profiling, which requires 1 If the function takes fewer than four arguments the extraneous argument values are invalid and should be ignored. only recording the function start address and time spent in that function, which is accumulated in a histogram (i.e. time for individual function calls are lost; only the total time spent in each function is recorded). There are common techniques for implementing a trace system, namely: 1. rewrite-based (a.k.a. self-modifying code). A program or driver sets a breakpoint or a jump by actually rewriting part of the kernel text. The kernel manages the breakpoint or jump by calling a specified function. The function can be pre-defined and the same for all traces, or arbitrary and specified when the trace is created. 2. code-based. Programmers insert a call to a logging function at various places in the code at compile time. 3. automated. This technique is a variation on code-based. As part of the kernel build process, the compiler or linker generates code that calls a function for each function entry and exit. 4. hardware. Code is instrumented by hardware using special-purpose attached I/O devices or logic analyzers 3. This technique often requires some form of code-based support to write tags to the hardware at the proper points. 2.1 Rewrite-based Rewrite-based tracing replaces a portion of the kernel code with code that calls a handler. The trace setup code allocates a buffer, saves the overwritten code in it, and installs a jump to the code buffer in place of the overwritten code. The buffer contains overwritten code, a call to the trace support code, and code that resumes the function. A corresponding code buffer is created for function exit. The handler can be a user-provided function, sometimes called “trigger code”. Typically, trigger code logs the event and variables of interest, which are almost always the parameters to the function. The code that is written over can be replaced with a breakpoint instruction or a jump instruction. Kprobes(2), in the Linux kernel, is breakpoint-based; djprobes is jump-based – although as part of the installation process, djprobes begins by inserting breakpoints. Breakpoints have the advantage of being small, usually one byte; they have the disadvantage of high cost in time, since a breakpoint requires interrupt handling. In some of these systems, a pre-written function is called; in others, users tracing kernel code must write a complete kernel module containing the trigger code functions. The trigger code can be used for more than one traced function but must be able to disambiguate the multiple callers - i.e., function exit and entry, or different functions. In Kprobes, DJprobes, and many other systems, the user must write a module that explicitly names the functions to be traced, when the trace module is compiled. Further, users wishing to export the data from the trigger code to user mode must write additional code to move the data to another kernel subsystem (e.g. relayfs, now known as relay), from which the user program can extract it from the kernel. Kprobes supports the tracing, but not the data transport. As might be expected, supporting self-modifying code is a very complicated process, when all the possible execution paths are taken into account. Multiprocessor machines further add a host of difficulties. The code may be rewritten in one processor, but we have no guarantee of when or if other processors will see the changes – or, still worse, see some but not all of the rewritten memory, as pointed out in (4). Further the task of moving code so as to redirect it requires that we determine if the execution of an arbitrary piece of code will ever terminate; is it ever safe to remove a probe? 2.1.1 dtrace This section would be incomplete without a reference to Sun’s dtrace(1), possibly the most sophisticated rewrite-based system, and certainly the standard by which all other kernel trace tools are measured. Dtrace is a debugging oriented tool, and hence has a great deal of flexibility. Dtrace can set a probe point on any of tens of thousands of places in the Solaris kernel. The trace points can run always installed, since their cost when not activated is zero. Unlike most other Linux or Unix trace systems, dtrace provides a rich support system for naming probes, acquiring the data created when probes are triggered, and processing the data to simplify analysis. Dtrace supports both so-called “static tracing”, essentially a code-based tracing mechanism, and a “function boundary” tracing mechanism, implemented with the same technique as DKM: relying on the fact that function entry and exit have a characteristic set of location-independent instructions. Dtrace replaces one instruction with a TRAP instruction, and, rather than executing the written over code, emulates it in software. Consequently, the cost of executing a dtrace trigger code is fairly high. Dtrace shares the problem of most code rewriting strategies, in that the kernel code can be in an invalid state while the rewrite is being done. 2.1.2 Performance issues with code rewrite tracing As we can see, code rewriting is complex and has a number of non-obvious costs: • making sure no processors are executing code that is being rewritten • making sure all processors see the changes once they are finished • executing the code that has been moved • managing the problems that can occur when arbitrary code has been moved to another location. One additional concern that is not immediately obvious, for performance measurement, is the differential cost of executing a function when it is traced. Consider the case when non-traced functions, in a tracing-enable kernel, see no performance penalty. Those functions that are traced will appear to have a much higher comparative cost than they do in reality. If we measure, e.g., the performance of a program that uses non-traced functions, we will artificially inflate the cost of a program that uses traced functions. In contrast, in the Plan 9 trace device, the time to run all functions is uniformly increased whether they are traced or not. As a result there is a closer correspondence between the time to execute two functions, and hence two programs which use those functions, even if one is set up to be traced and one is not. In short, a zero cost penalty for non-traced functions could lead users to attribute a higher time cost to traced functions than is in fact the case. 2.2 Code-based Code-based systems have been around for some time. A more recent version of code-based tracing is Linux kernel markers. Programmers insert “markers” at points of interest in the kernel source, e.g.: trace mark(blk request, “count is %d”, count); The markers are disabled by default. They are enabled by calling a function which names the marker, and contains a function pointer and a pointer to private data. The function will be called when the marker is hit, with the data passed to it. In order for kernel markers to become generally useful, large parts of the kernel – all 50 Mbytes of it – need to have kernel markers added. Adding this additional code to the kernel is quite a major effort and will take some time. As of 2.6.25, only four markers have been created. 2.3 Automated trace In an automated trace system, the kernel or the linker generates the trace support code via a build-time command. An example is the Plan 9 kernel profiling facility, which is invoked from the Plan 9 linker. The linker in Plan 9 is capable of inserting code or optimizing code away – in fact, it shares code generation responsibilities with the compiler. When it is invoked with a -p switch, the Plan 9 linker inserts a call to profin at the entry point of the function, and a call to profout at each exit. The profiling library is also linked in as part of this process. The only information passed to and used in the profiling functions is the program counter. The counter is used to create a histogram of time spent in functions. Relationships between functions, and time for certain types of calls, is not collected. 2.4 Data extraction and analysis As mentioned above, tracing is only part of the problem. Once the trace function has been activated, it must produce information and deliver it to a consumer. The simplest consumer is the kernel system log. Data is produced for the log by a print function. The bandwidth provided by the log, and the performance impact of using it, are such that it is rarely used: it is very easy to create so much data from printing that it is overrun and lost. Instead, the trace facility can provide a way to provide data for user-level consumers, as in dtrace; or, the trace facility might require that users set up the means by which data is delivered to consumers, as in kprobes, djprobes, and other systems. Another issue concerns the format of the data. Kprobes, djprobes, and kernel markers all allow unrestricted creation of data streams, both in content and record size. While a lack of restrictions might seem desirable, it can be difficult for programs to parse all the possible variations of data output – this same problem has been seen and documented for, e.g., /proc (5). The four markers present in 2.6.25 have this format: ”name %s format %s” ”name %s format %s”, ”ctx %p spu %p”, ”ctx %p”, This problem has been dealt with before, and it is easy to solve: if the data size and content are arbitrary, then the format should be in a self-describing, self-contained format, e.g. s-expressions as defined in (5). A self-describing format has many advantages, not the least being that output from multiple sets of markers can easily be processed by a program which is only processing a subset of the markers. Programs need not concern themselves with all possible marker formats, since the self-describing structure of the data makes it easy to skip markers that are not of interest. Data can be saved and resurrected years later, and the structure of the data is readily apparent. Dtrace, and our Plan 9 trace device (devtrace) opt for a fixed-format, fixedsize data format, for reasons of processing complexity and overhead. Devtrace also fixes the content of the data: a function entry/exit tag; a cycle counter (processor clock); the program counter; and the first four parameters (entry) or the return value (exit) of the function. Dtrace has a bit more flexibility but also has a fixed record size. 2.5 Summary We have only touched upon a small fraction of the many available tracing facilities. Tracing facilities have been developed over at least the last four decades; we focus mainly on the Linux systems as they are the most likely to be familiar to the reader. The systems vary little in their implementation; some are dynamic, and installed by code rewrite; others are written into the kernel as code by programmers; still others are inserted automatically into the kernel by the compilation toolchain. The system we have built for Plan 9 (devtrace) is based on the automatic approach. We build the kernel with profiling enabled but replace the normal profiling functions. Our trace functions allow users to conditionally enable both individual functions and individual processes. We can trace file I/O calls from an editor to the file server and back. The control of which functions and processes to trace is accomplished by writing textual commands to a control file. We now describe this system in more detail. 3 The Plan 9 trace device The Plan 9 trace device is an automated trace device that does not use code rewriting. To use it, programmers add the -p switch to the Plan 9 linker command for the kernel, and also link in two additional files: the C code for the trace device itself and the assembly code that implements and replaces the standard profin and profout functions. The assembly code is needed to ensure that the interposed profiling calls do not interfere with argument or return values. The profin and profout assembly code is limited to the minimal support needed on a per-architecture basis. The functions test to see if tracing is globally enabled and, if so, push the first four args (on entry) or the return value (on exit) and call C functions named tracein and traceout. The main modification from the standard functions is in the provision of the additional information. The C code implements the rest of the tracedev functionality and, again, provides a device interface for controlling the device, determining status, and reading the data. In the Plan 9 manner, the device supports two files: ctl and data. Data file The data file is read-only and returns trace records as text. Ctl file The ctl file, when read, returns information about the state of the device. As in many Plan 9 devices, strings read from the ctl file contain valid commands; the output of the ctl file can be saved and written back to the ctl file. The main use of the ctl file is to control tracing. Programs, scripts, or users echo commands into the file. The commands are shown in Table 1. This set of operations allows tracing some or all of the kernel. We can restrict the set of functions traced, as well as the set of processes traced. We can follow a write system call from a process to the file system server, its archival backup, and from there to the main disk drive. We can look at the sizes of writes and determine where the bottlenecks are in the I/O system. Most importantly, by design, the overhead is uniform – not low, however, but uniform, so that the cost of functions relative to each other is roughly the same, traced or untraced. 3.0.1 The difficulties of simplicity As we have noted, the trace device interface and the device itself are small: the C code is 800 lines, or roughly 1/3 the size of the Linux kprobes code; the assembly is less than 50 lines. The device described is the result of several iterations, not on just Plan 9, but on Linux. We started with a complex device that rewrote kernel code. The complexity of that code, comparable in scope to kprobes or djprobes, led us to look for a better and simpler (albeit less capable) design. Our work paid off: it took two hours to port the trace device from the AMD 64 kernel to the BG/P kernel. Command trace parameters new trace trace remove on trace off size query
testtracein watch start stop description Creates a new trace. trace . The trace is set up but not enabled. Large batches of traces can be set up and enabled later. removes the trace named enables the trace named disables the trace named resizes the kernel-based trace buffer to 2size records determines if the given address is traced. Useful for testing. simulates entry into traced code. Useful for testing both the device and programs that process the data. enables tracing on a process id (PID)-specific basis and further enables tracing on that one PID only. enables tracing globally (the flag is tested in profin/ profout assembly code) disables tracing globally (the flag is tested in profin/ profout assembly code) Table 1: Trace device commands tcfd = open("/dev/tracectl", ORDWR); tree = open("/dev/vc0net", ORDWR); m = smprint("watch %d", getpid()); write(tcfd, m, strlen(m)); write(tcfd, "start", 5); write(tree, "hi", 2); write(tcfd, "stop", 4); /* /* /* /* /* /* /* open the trace device */ open the tree network */ command to watch ourselves */ write watch self command */ start tracing */ do tree I/O */ stop tracing */ Figure 4: The test program One of our goals is to make the trace data easily accessible to users. Linux has done an impressive job in this area with tools such as SystemTap. But, in the end, the Linux tools are very complex systems that are put in place to control other very complex systems. Using the raw interface of kprobes is a daunting task, requiring a lot of knowledge of the users. SystemTap eases the pain, but at the cost of comprehension when things do not go as planned. In contrast, tracedev follows the Plan 9 path of a simple, regular device interface that can be directly used – even from the shell or command line. We regularly control tracedev by echoing commands into the ctl file and using cat to read the data file. The power of this interface is hard to overstate. Any tool that can process textual data can pull data from the trace device and process it. A non-expert can easily use tracedev to monitor Plan 9. 4 Use cases In this section we discuss the processing pipeline which produced the graph in Figure 1. We also show a quick analysis of IO sizes for an interactive Plan 9 session and an answer to two questions: “What addresses are used when a process communicates with the kernel? Can we cache translations to speed up system calls?”. 4.1 Example 2: how long does it take to get from a process to an I/O device? We wish to know how long it takes a process to write to an I/O device, as we need to reduce this time to be competitive with MPI. The test program is shown in Figure 4. The simplicity should be apparent: open a control file and device; write commands to watch the process and then start the trace run; write to the tree; stop tracing. The results are shown below. The write is currently not nearly fast enough; the writes hit the wire at the beginning of txstart, which on close inspection is almost 90,000 ticks in. That said, we do hit the treewrite function at only a few thousand ticks in; the opportunities for optimization are pretty obvious. Figure 5: Results of watching tree network write 163959 164000 119278 122962 125484 125706 126022 127038 127238 127310 128128 131146 131250 131348 131592 132512 134750 135780 138024 138014 143780 149070 154090 156636 102526 104784 107466 110206 112420 73945.5 77431.5 78455.5 78489.5 79357.5 81787.5 81804.5 84452.5 89443 90085.5 90076.5 91247.5 91294.5 93069.5 95232.5 35602.5 38912.5 41203.5 44411.5 46539.5 47185.5 48834.5 49050.5 49291.5 49312.5 50777.5 53587.5 55282.5 56093.5 56141.5 59500.5 59849.5 60880 62286.5 65440.5 67425.5 19185.5 19293.5 19347.5 25239.5 25262.5 12681.5 trap syscall syspwrite write treewrite treenetsend qbwrite qnotfull wakeup txkick txstart freeb free poolfree poolfreel pooladd treeinsert listadd blockcheck memset qget txfifofull qlen allocb _allocb msize poolmsize getdsize D2B mallocz setrealloctag setmalloctag poolalloc punlock poolallocl B2D trim blocksetdsize pooldel treedelete listdelete ltreewalk treelookupgt dsize2bsize plock fdtochan incref validaddr okaddr seg memmove cclose decref echo echo echo echo trace trace trace trace 17a099 17a0a2 new pr > /dev/tracectl 17a2cf 17a2d8 new pw > /dev/tracectl pr on > /dev/tracectl pw on > /dev/tracectl Figure 6: Command for tracing pread and pwrite 4.2 Example 3: What are typical IO sizes? We wished to know what size I/O operations programs initiate. All I/O goes through the pread and pwrite system calls. To get a quick idea of what might be going on, we decided to monitor only the return values of these calls. We set up tracing as shown in Figure 6. The result is shown in Figure 7. Almost 90% of I/O operations are less than 1Kbyte; 1/3 are under 8 bytes. An I/O enhancement strategy designed around these figures could greatly improve performance. 4.3 Example 4: What addresses are used when a process communicates with the kernel? Can we cache translations to speed up system calls? We consider the case of processes communicating with the kernel. In earlier work, which showed us the need for the trace device, we determined that getting user data into and out of the kernel was a costly process. A large fraction of this time is mapping user level pointers into the kernel address space so that copying can be done. We speculated that If we can cache frequently used mappings, we might improve performance, particularly for read and write. For this simple test, we measure tar cf /dev/null /sys/src. Processes communicating with the kernel pass a user-mode virtual address to several system calls. It would be a bit of work to set up the 20 or so triggers, but fortunately there is a function in the kernel, okaddr, which is called to validate user addresses. Hence we can watch the parameters to the okaddr function. The success of a caching strategy is critically dependent on the number of different addresses used – if the number is small enough (e.g. 32), we can easily cache address mappings; if the number is too large (e.g. 16384) then it is unlikely that caching is practical. For this test, our program records the PID and the page address passed to okaddr. We only monitor function entry. The results shown in Table 2 indicate that we can cache as few as 32 page address translations for a process and eliminate much of the cost of both checking a virtual address and converting it to a physical address for kernel I/O. Figure 7: I/O sizes for an interactive Plan 9 session on a network terminal. The X axis is I/O size and the Y axis is a cumulative count of the number of I/Os. Of the total of 1400 I/Os, 1380 of them were less than 500 bytes. PID 102 150 176 176 176 176 176 176 176 176 176 176 176 176 176 176 176 176 176 page address 4294967294 4294967294 15 16 17 18 19 20 21 22 23 24 27 28 29 30 31 32 33 count 6 6 63 4087 556 484 188 254 445 431 566 51 43 264 201 104 121 166 68 PID 176 176 176 176 176 176 178 178 193 199 199 199 199 51 51 51 51 51 51 page address 36 37 38 39 40 4294967294 4294967292 4294967294 4294967294 33 34 4294967292 4294967294 1 10 18 19 38 4294967294 count 5 61 131 6 3 14034 3 3 3 3 2 8 6 3 3 1 2 44 2 Table 2: kernel addresses used for tar, file server, and other server processes for tar pipeline Profiling level User System Real None Included but disabled Tracing I/O system calls profin and profout returning immediately 2.94 3.87 5.23 4.13 12.07 16.34 30.08 16.57 15.76 20.24 35.32 20.71 % penalty 0 28 74 28 Table 3: Performance for various levels of tracing 5 Performance The performance impact of any tracing system is always of concern. To test performance, we set up a comnmand to copy data from /dev/zero to /dev/null, on megabyte at a time, in a kernel compiled without profiling, one compiled with profiling but with tracing turned off, and then again on the profiling kernel with tracing enabled. A kernel was also compiled with profiling enabled and the assembly-level profin and profout functions executing RET (return) immediately upon entry. Table 3 shows the results. There is a very clear difference in performance between each test. Simply using a kernel compiled with tracing resulted in a 28% time increase over the non-profiling kernel. Going from a profiling kernel with tracing disabled to a profiling kernel tracing a section of memory gave a 74% increase in real time. Interestingly, having profin and profout return immediately gives about the same 28% hit as using a profiling kernel with tracing disabled; this is likely due to the pipeline being cleared by the CALL instruction. We may need to further modify the linker to either inline the functions or find some other way to make profiling-disabled functions more efficient. However, when the intended use of the the trace device is taken into account, i.e. comparative measures of internal kernel function performance, these performance hits are not particularly problematic. Tracing is useful for gaining an idea of which functions take longer to execute. Since tracing creates an equal penalty for all traced functions, the ratios of execution times will still remain the same. As the plot in Figure 1 indicates, useful information can be gleaned without even knowing the time scales involved – it is sufficient to simply look at the graphs and see the different lengths of function execution. 6 Performance optimization The current design uses the Plan 9 linker to insert an always-executed call to _profin and _profout at entry and exit points. As noted, this adds a fixed cost of almost 28% to common kernel operations. Hence, we can not ship a kernel with tracing always enabled, as Sun does. The question must be asked: could we change how we enable tracing? It turns out we can, if we are willing to consider using a rewrite-based approach. As it happens, we can make this approach efficient, SMP-safe, and not require additional code buffers for saving and restoring function code. We can completely eliminate the ’invalid kernel state’ problem that the other code rewrite systems have, since the code change is only one byte and by definition can not span cache lines. The linker currently emits the following code, for every function: CALL _profin(SB) We can modify the linker to emit a slightly different sequence: BR .+7 CALL _profin(SB) The result would be that calls to _profin would never happen. In order to enable the call to _profin, one would rewrite the branch (either before the kernel is booted, or from the trace device) as follows: BR .+2 CALL _profin(SB) The function return case is easy: instead of CALL _profout(SB) RET We have the linker emit: RET CALL _profout(SB) RET To trace-enable a return, we simple change the RET to a NOP. The cost for the non-trace-enabled return is zero. The only potential concern is the cost of the added branch on function entry: every function will have an added BR .+7 as the first instruction. This seems like it ought to slow things down. As it turns out the penalty is not nearly as bad as we might expect. Initial benchmarks showed encouraging results. We modified the Plan 9 loader to emit this sequence by default for profiled kernels, and a number of use-based benchmarks showed a 3 percent overhead, and as low as 1.5 percent on an Opteron. In fact we are using this kernel almost continuously now as the performance impact is really not noticeable. In our implementation of a code rewrite system, unlike the others described above (dtrace, kprobes, djprobes), the kernel code is never in an invalid state: it transitions from one valid state to another. There is no need for an external code buffer, multiprocessor synchronization as the probes are installed and removed, or all the other complex overhead of the other rewrite systems. This design represents a substantial improvement over other trace devices, combining the best attributes of most of them: minimal overhead when not enabled; no invalid kernel state; and low cost for inserting a probe 7 Conclusions and future work. We have shown a kernel device for measuring Plan 9 kernel overhead on the IBM BG/P supercomputer. The device, devtrace, follows the Plan 9 model of providing a simple, textual control interface that requires no C code or even programming on the user’s part. It differs from other efforts in that it does not use complexity to hide complexity; rather, it is a very simple device. We showed two possible implementations. The first requires no self-modifying code as many other trace devices do. It does extract a high performance penalty, however. The second implementation extracts a very low penalty. The second does require self-modifying code, but not the unsafe self-modifying code used in, e.g., Kprobes or DJProbes: the kernel code never makes a transition from valid to invalid to valid, but rather only makes a transition between two valid states. We showed a number of uses of the trace device, including system call latency measurement, overhead measurement, and I/O size measurement. Finally, we showed that the kernel could implement a cache for virtual addresses that would be effective with as few as 32 entries. The I/O size measurements and the virtual address measurement point to ways to greatly improve I/O performance while maintaing the Plan 9 I/O model. This work has relevance to other operating systems as well. We could modify other compilers, such as gcc, to emit the performance-optimized trace calls shown above. It would be easy to have gcc generate the instrumentation for a Linux kernel. In some ways the gcc work would be easier; we would not need to write the assembly code interface, but could, rather, have gcc generate it. Future work includes improving the display of results, as well as providing a more automated interface. References [1] B.M. Cantrill, M.W. Shapiro, and A.H. Leventhal, Dynamic instrumentation of production systems, 2004, pp. 15–28 (English). [2] R. Krishnakumar, Kernel korner: kprobes-a kernel debugger, Linux J. 2005 (2005), no. 133. [3] Andrew McRae, Hardware profiling of kernels, or: How to look under the hood while the engine is running, (1993). [4] Satoshi Oshima, Djprobes status. [5] Matthew J. Sottile and Ronald G. Minnich, Supermon: A high-speed cluster monitoring system, CLUSTER ’02: Proceedings of the IEEE International Conference on Cluster Computing (Washington, DC, USA), IEEE Computer Society, 2002, p. 39.