KvmFS: Virtual Machine Partitioning For Clusters and Grids Andrey Mirtchovski Los Alamos National Laboratory Latchesar Ionkov Los Alamos National Laboratory andrey@lanl.gov lionkov@lanl.gov Abstract This paper describes KvmFS, a synthetic file system that can be used to control one or more KVM virtual machines running on a computer. KvmFS is designed to provide its functionality via an interface that can be exported to other machines for remote configuration and control. The goal of KvmFS is to allow a multiCPU, multi-core computer to be partitioned externally in a fashion similar to today’s computational nodes on a cluster. KvmFS is implemented as a file server using the 9P protocol and its main daemon can be mounted locally via the v9fs kernel module. Communication with the KvmFS occurs through standard TCP sockets. Virtual machines are controlled via commands written to KvmFS’ files. Status information about KVM virtual machines is obtained by reading KvmFS. KvmFS allows us to build clusters in which more than one application can share the same SMP/Multi-core node with minimalistic full system images tailored specifically for the application. 1 Introduction The tendency in high-performance computing is towards building processors with many computational units, or cores, with the goal of parallelizing computation so that many units are performing work at the same time. Dual and quad-core processors are already on the market, and manufacturers are hinting at 8, 32, or even 80 cores for a single CPU, with a single computational node composed of two, four, or more CPUs. This will result in applications running and contending for resources on large symmetric multiprocessor systems (SMPs) composed of hundreds of computational units. There is a problem with this configuration, however: since clusters are currently the dominant form of node organization in the HPC world (as seen in the latest breakdown by machine type on the Top 500 list of supercomputers), most applications are designed to either run on a single 2- or 4-core machine, or so that separate parts of the program will run on separate 2- or 4-core nodes, and will communicate via some message-passing framework such as MPI. This has resulted in most of the applications running here, at Los Alamos National Laboratory, scaling to at most 8 CPUs on a single node. Furthermore, many applications assume that they are the only ones running on a single node and will not have to contend for resources. To satisfy the requirements of such applications, large SMP computers will have to be partitioned so that applications are ensured dedicated resources without contention. Fail-over and resilience, two very hot topics in High Performance Computing, also require the means to transfer an application from one machine to another in the case of hardware or software component failures on the original computer. One solution for partitioning hardware and providing resilience has gained widespread adoption and is considered feasible for the HPC world: virtualization using hypervisors. Borrowing from the mainframe, it allows separate instances of an operating system (or indeed separate operating systems) to be run on the same hardware or parts thereof. The two major CPU manufacturers have added support for virtualization to their newest offerings, which provides even greater performance gains than previously thought. KVM has recently emerged as a fast and reliable (with the hardware support on modern processors) subsystem for virtualizing the hardware on a computer. Our goal with KvmFS is to enable KVM to be remotely controlled by either system operators or schedulers and to allow it to be used for partitioning on clusters composed of large SMP machines, such as the ones already being proposed here at LANL. Virtualization benefits the system administrator, as well as programmers and scientists running highperformance code on large clusters. One benefit is the • 59 • 60 • KvmFS: Virtual Machine Partitioning For Clusters and Grids full control over the operating system installation that an application requires. For example, it is not necessary to have all support libraries and software installed on all machines of a cluster, instead, the application is run in an OS instance that already contains all that is required. This greatly simplifies installations in the case where conflicting libraries and support software may be required by different applications. Another possibility is to run a completely different operating system under virtualization, something impossible in current monolithic cluster environments. With the fast and reliable means of running applications on their own slice of a SMP, it is convenient to be able to extend the control of partitioning and virtualization across the cluster to a control or a head node. This is the niche that KvmFS fills: it provides the fast and secure means to control VMs across a cluster, or indeed a grid environment. 1.1 KVM KVM [14] is a hypervisor support module in the Linux kernel which utilizes hardware-assisted x86 virtualization on modern Intel processors with Intel Virtualization technology or AMD’s Secure Virtual Machine. By adding virtualization capabilities to a standard Linux kernel, KVM provides the benefits of the optimizations that exist in a standard kernel to virtualized programs, greatly increasing performance over “full hypervisors” such as Xen [1] or VMWare [9]. Under the KVM model, every virtual machine is a regular Linux process scheduled by the standard Linux scheduler. Its memory is allocated by the Linux memory allocator. KVM works in conjunction with QEMU to deliver the processor’s virtualization capabilities to the end user. 1.2 QEMU QEMU [2] is a machine emulator which can run an unmodified target operating system (such as Windows or Linux) and all its applications in a virtual machine. QEMU runs on several host operating systems such as Linux, Windows, and Mac OSX. The primary usage of QEMU is to run one operating system on another, such as Windows on Linux or Linux on Windows. Another usage is debugging, because the virtual machine can be easily stopped, and its state can be inspected, saved, and restored. Moreover, specific embedded devices can be simulated by adding new machine descriptions and new emulated devices. Although the host and target operating systems can be different, our software will focus on Linux as the host system since Linux is the primary OS on all of our recent clusters at LANL and is widely adopted for HPC environments. Also, KVM currently exists only for the Linux kernel. 2 Design KvmFS was created allow its users to run and control virtual machines in a heterogeneous networked environment. As such, KvmFS was designed to fulfill the following tasks: functionality provide an interface that allows management of VMs on a cluster scalability provide the ability for fast creation of multiple identical VMs on different nodes connected via a network checkpoint and restart provide the ability to suspend virtual machines and resume their execution, potentially on a different node The design of KvmFS follows the well established model of providing functionality in the form of synthetic file systems which clients operate on using standard I/O commands such as read and write. This method has proven successful in various operating systems descendant from UNIX. The /proc [7] file system is a very well established example. The “Plan 9” operating system further extends this concept. It presents the network communication subsystem as mountable files [13] or even the graphics subsystem and the window manager written on top of it, as a file system. Implementations such as the above suggest that the concept is feasible and that implementing interfaces to resources in the form of a file system and exporting them to other machines is a very good way to quickly allow access to them from remote machines, especially since files are the single most exported resource in a networked environment such as a cluster. KvmFS is structured as a two-tiered file server to which clients connect either from the local machine or across 2007 Linux Symposium, Volume Two • 61 the network. The file server allows them to copy image files and boot virtual machines using those image files. The file server also allows controlling running virtual machines (start, stop, freeze), as well as migrating them from one computer to another. 3.2 The top-level directory KvmFS serves contains two files providing information about the architecture of the machine as well as starting a new session for a new VM. Each session already started is presented as a numbered subdirectory. The subdirectory itself presents files which can be used to control the execution of the VM, as well as a subdirectory which allows arbitrary image files to be copied to it and used by the VM. The KvmFS filesystem is presented in detail in section 3. Ctl is used to execute and control a session’s main process. Reading from the file returns the main process pid if the process is running, and –1 otherwise. The operations on the session are performed by writing to it. 3 The KVM File System KvmFS presents a synthetic file system to its clients. The file system can be used for starting and controlling all aspects of the runtime of the virtual machines running on the machine on which kvmfs is running. ctl info id fs/ 3.1 These files are contained in the session directory which is created when a client opens the clone file of a KvmFS server. Reading from info returns the current memory and device configuration of the virtual machine. The format of the information is identical to the commands written to ctl file. Id is used to set and get the user-specified VM identifier. The fs directory points to the temporary storage created for the virtual machine. The user can copy disk images and saved VM state files that can be used in the VM configuration. 4 clone arch vm#/ Top-level files Arch is a read-only file; reading from it returns the architecture of the compute node in a format operatingsystem/processor-type. Clone is a read-only file. When it is opened, KvmFS creates a new session and the corresponding session directory in the filesystem. Reading from the file returns the name of the session directory. Vm# is a directory corresponding to a session created by a KvmFS client. Even though a session may not be running, Vm# will exist as long as that client keeps the clone file open. If the virtual machine corresponging to a session is running the clone file may be closed without causing the Vm# file to disappear. Session-level files KvmFS Commands The following section describes the set of commands available for controlling KvmFS instances: dev name image Specifies the device image for a specific device. Name is one of hda, hdb, hdc, hdd. If image is not an absolute path, it should point to a file that is copied in the fs directory. An optional boot parameter can be provided to specify that the device should be used to boot from. net id mac Creates a network device with ID id and MAC mac. loadvm file Loads a saved VM state from file file. If file is not an absolute path, it should point to a file in the fs directory. storevm file Stores the state of the VM to file file. If file is not an absolute path, the file is created in the fs directory. power on|off Turns VM power on or off. freeze Suspends the execution of the VM. unfreeze Resumes the execution of the VM. 62 • KvmFS: Virtual Machine Partitioning For Clusters and Grids clone max-vms address-list Creates copies of the VM on the nodes specified by address-list. Copies the content of the fs directory to the remote VMs and configures the same device configuration. If the virtual machine is already running, stores the current VM state (as in storevm) and loads it in the remote VMs. If max-vms is greater than zero, and the number of the specified sessions is bigger than max-vms, clone pushes its content to up to max-sessions and issues clone commands to some of them to clone themselves to the remaining VMs from the list. The format of the address-list is: address-list = \ 1*(vm-address ‘,’’) vm-address = node-name \ [‘!’’port] ‘/’’ vm-id node-name = ANY port = NUMBER vm-id = ANY 5 Implementation There are two ways of implementing accesses to programs or system resources as files in Linux, either using Fuse [3] or the 9P [8] protocol. We chose the 9P protocol because it is better suited for communicating with file systems over networks. 9P has also been in use for the past twenty years and is sufficiently hardened to be able to handle various workloads on environments ranging from a single machine to thousands of cluster nodes [5]. Furthermore, our team is well familiarized with 9P through the implementation of V9FS, the kernel module allowing 9P servers to be mounted on a Linux filesystem [10] [4]. It is important to point out, however, that there is no significant barrier to implementing KvmFS using FUSE. 5.1 9P Representing operating system resources as files is a relatively old concept exploited to some extent in the original UNIX operating system, but it matured extensively with the development and release of the “Plan 9 from Bell-Labs” operating system [12]. “Plan 9 from Bell-Labs” uses a simple, yet very powerful communication protocol to facilitate communication between different parts of the system. The protocol, named “9P” [8], allows heterogeneous resource sharing by allowing servers to build a hierarchy of files corresponding to real or virtual system resources, which then clients access via common (POSIX-like) file operations by sending and receiving 9P messages. The different types of 9P messages are described in Table 1. There are several benefits of using the 9P protocol: Simplicity The protocol has only a handful of messages which encompass all major file operations, yet it can be implemented (including the co-routine code explained above) in around 2,000 lines of C code. Robustness 9P has been in use in the Plan 9 operating system for over 15 years. Architecture independence 9P has been ported to and used on all major computer architectures. Scalability Our Xcpu [11] suite uses 9P to control and execute programs on thousands of nodes at the same time. A 9P session between a server and its clients consists of requests by the clients to navigate the server’s file and directory hierarchy and responses from the server to those requests. The client initiates a request by issuing a T-message, the server responds with an R-messages. A 9P transaction is the combined act of transmitting a request of particular type by the client and receiving a reply from the server. There may be more than one request outstanding; however, each request requires a response to complete a transaction. There is no limit on the number of transactions in progress for a single session. Each 9P message contains a sequence of bytes representing the size of the message, the type, the tag (transaction id), control fields depending on the message type, and a UTF-8 encoded payload. Most T-messages contain a 32-bit unsigned integer called Fid, used by the client to identify the “current file” on the server, i.e., the last file accessed by the client. Each file in the file system served by our library has an associated element called Qid used to uniquely identify it in the file system. 5.2 KvmFS KvmFS is implemented in C using the SPFS and Spclient [6] libraries for writing 9P2000-compliant userspace file servers and accessing them over a network. 2007 Linux Symposium, Volume Two • 63 9P type version auth error flush attach walk open create read write clunk remove stat wstat Description identifies the version of the protocol and indicates the maximum message size the system is prepared to handle exchanges auth messages to establish an authentication fid used by the attach message indicates that a request (Tmessage) failed and specifies the reason for the failure aborts all outstanding requests initiates a connection to the server causes the server to change the current file associated with a fid opens a file creates a new file reads from a file writes to a file frees a fid that is no longer needed deletes a file retrieves information about a file modifies information about the file Table 1: Message types in the 9P protocol It is a single-threaded code which uses standard networking via the socket() routines. Although our implementation is in C, both 9P2000 and KvmFS are language-agnostic and can be reimplemented in any other programming language that has access to networking. OS Image files used by virtual machines can grow to be quite large (sometimes up to the size of a complete system installation: several gigabytes) and can take a long time to be transferred to a remote node. To start a single VM on all the nodes of a cluster can potentially take upwards of an hour for large clusters, with literally a hundred percent of the time being spent transferring the disk images of the VM either from a head node or from a networked file system such as NFS. To alleviate this problem we can employ tree-based spawning of virtual machines via cloning. During tree-spawning, if an end node has received the complete image (or in some cases a partial image), that node can retransmit the image to another node, potentially located only a hop away on the network. To allow tree-spawns each KvmFS server can also serve as a client to another server by implement- ing routines which connect over 9P, create new sessions, set-up and start a new VM with the image from the local session. This reduces logarithmically the amount of fetches that need to occur from the head node and significantly increases the scale at which KvmFS can be deployed. We have tested tree-spawn algorithms for small images on several thousand nodes on LANL’s clusters. The total number of lines for KvmFS, not including the SPFS libraries, is less than two thousand lines of code. SPFS itself is 5,158 lines of code, and Spclient is another 2,381 lines of code. 6 Sample Sessions Several examples of using KvmFS follow. The examples show systems mounted remotely using the v9fs [4] kernel module and consequently being accessed via common shell commands. In the examples below, the names n1, n2, etc., are names of nodes on our cluster. 6.1 Create a virtual machine This example creates a virtual machine using two files copied from the home directory. Disk.img is set to correspond to hard drive hda and vmstate is used as a previously saved virtual machine. mount -t 9p n1 /mnt/9 cd /mnt/9 tail -f clone & cd 0 cp ~/disk.img fs/disk.img cp ~/vmstate fs/vmstate echo dev hda disk.img > ctl echo net 0 00:11:22:33:44:55 > ctl echo power on freeze > ctl echo loadvm vmstate > ctl echo unfreeze > ctl 6.2 Migrate a virtual machine to another node This example shows the migration of a virtual machine from one node to another. mount -t 9p n1 /mnt/9/1 mount -t 9p n2 /mnt/9/2 tail -f /mnt/9/2/clone & cd /mnt/9/1/0 echo freeze > ctl echo ‘clone 0 n2!7777/0’ > ctl echo power off > ctl 64 • KvmFS: Virtual Machine Partitioning For Clusters and Grids 6.3 Create clones of a virtual machine This example shows the cloning of a virtual machine onto a new computer. [2] F. Bellard. Qemu, a fast and portable dynamic translator. USENIX 2005 Annual Technical Conference, FREENIX Track, 2005. [3] FUSE. Filesystems in userspace. http://fuse.sourceforge.net/. mount -t 9p n1 /mnt/9 cd /mnt/9/0 echo ‘clone 2 n2!7777/0,\ n3!7777/0,\ n4!7777/0‘ > ctl [4] Eric Van Hensbergen and Latchesar Ionkov. The v9fs project. http://v9fs.sourceforge.net. 7 [5] Eric Van Hensbergen and Ron Minnich. Grave robbers from outer space: Using 9p2000 under linux. In Freenix Annual Conference, pages 83–94, 2005. Conclusions And Future Work We have described the KvmFS file system which presents an interface to virtual machines running on Linux in the form of files accessible locally or remotely. KvmFS allows us to extend the control of the partitioning and running of virtual machines on a computer beyond the system on which the virtual machines are running and onto a networked environment such as a cluster or a computational grid. KvmFS benefits large cluster environments such as the ones in use here, at the Los Alamos National Laboratory, by enabling fine-grained control over the software running on them from a centralized location. Status information regarding the parameters on currently running VMs can also easily be obtained from computers other than the ones they are executing on. Our system also allows checkpointing and migration of VMs to be controlled from a centralized source, thus enabling partitioning schedulers to be built on top of KvmFS. Future work we have planned for KvmFS is in the area of fine-grained control of the execution parameters of virtual machines running under KvmFS such as their CPU affinity. Also, we plan to integrate KvmFS with existing schedulers at LANL to provide a seamless way of partitioning our clusters. Another interesting issue we are exploring is exporting the resources of running virtual machines, such as their /proc filesystem, through the KvmFS interface so that processes running under the VM can be controlled externally or even over a network. References [1] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, and R. Neugebauer. Xen and the art of virtualization. 2004. [6] L. Ionkov. Library for writing 9p2000 compliant user-space file servers. http: //sourceforge.net/projects/npfs/. [7] T.J. Killian. Processes as files. USENIX Summer 1984 Conf. Proc., 1984. [8] AT&T Bell Laboratories. Introduction to the 9p protocol. Plan 9 Programmer’s Manual, 3, 2000. [9] R. Meushaw and D. Simard. Nettop: Commercial technology in high-assurance applications. http://www.vmware.com, 2000. [10] R. Minnich. V9fs: A private name space system for unix and its uses for distributed and cluster computing. [11] R. Minnich and A. Mirtchovski. Xcpu: a new, 9p-based, process management system for clusters and grids. In Cluster 2006, 2006. [12] R. Pike, D. Presotto, S. Dorward, B. Flandrena, K. Thompson, H. Trickey, and P. Winterbottom. Plan 9 from Bell Labs. Computing Systems, 8(3):221–254, Summer 1995. [13] D. Presotto and P. Winterbottom. The organization of networks in plan 9. USENIX Winter 1993 Conf. Proc., pages 43–50, 1993. [14] Qumranet. Kvm: Kernel-based virtualization driver. http://kvm.qumranet.com/ kvmwiki/Documents. Proceedings of the Linux Symposium Volume Two June 27th–30th, 2007 Ottawa, Ontario Canada Conference Organizers Andrew J. Hutton, Steamballoon, Inc., Linux Symposium, Thin Lines Mountaineering C. Craig Ross, Linux Symposium Review Committee Andrew J. Hutton, Steamballoon, Inc., Linux Symposium, Thin Lines Mountaineering Dirk Hohndel, Intel Martin Bligh, Google Gerrit Huizenga, IBM Dave Jones, Red Hat, Inc. C. Craig Ross, Linux Symposium Proceedings Formatting Team John W. Lockhart, Red Hat, Inc. Gurhan Ozen, Red Hat, Inc. John Feeney, Red Hat, Inc. Len DiMaggio, Red Hat, Inc. John Poelstra, Red Hat, Inc.