The Ubiquitous File Server in Plan 9 C H Forsyth Vita Nuova Limited 3 Innovation Close York Science Park York England YO10 5ZF forsyth@vitanuova.com 20 June 2005 1. Introduction Plan 9 is a distributed system begun in the late 1980s by the Bell Labs research centre where Unix was born.1 Rather than having each computer in a network act as if it still controlled everything, Plan 9 is designed to allow a single large system to be built from smaller cooperating systems performing specific tasks, notably file service, authentication, cpu service and interactive graphics. Although all system components can indeed run on a single machine, a typical development environment has a network of cheap single-user diskless terminals (usually PCs without disks these days) sharing permanent file storage and multi-processor cpu servers (often also diskless). Even the file storage implementation has two parts: a fairly conventional hierarchical file system stores data in a separate write-once block archiving service, and they can usefully run on separate machines. An authentication server runs only a special set of trusted processes to do authentication for the network, and implements only the few relevant protocols. Other cpu servers might provide resources for big computations or compilations, or provide SMTP, FTP and HTTP services to internal and external networks. A Plan 9 terminal runs programs that implement the interactive interface; it is not just a remote display as in some ‘thin client’ schemes. Plan 9 supports a range of processor types, and it is possible to mix and match within a network: Sparc, MIPS, Alpha, PowerPC and Intel. In the past it was common to use big multi-processor MIPS machines as cpu servers with various other architectures as terminals. Currently of course most implementations use computers based on Intel or AMD processors because they are cheap and available, but ARM and PowerPC still have a place, not least in the embedded realm, and the system source is kept portable. The system has some unusual elements, described in previous papers: its networking subsystem,2, 3 its window system,4 the Acme programming environment,5 the archival storage subsystem (Venti),6 the archival file store above that (Fossil),7 the earlier WORM-based file server,8, 9 the language-driven message exchange (plumber),10 its C compiler suite,11 its language-based debugger (Acid),12 details of its kernel implementation,13, 14 its real-time support,15 and its authentication subsystem.16 Some aspects have been adopted by other systems: the Unicode representation UTF-8, now in common use, was originally developed for Plan 9.17 Plan 9’s defining novelty, however, remains its representation of resources in a distributed system.18 __________________ Copyright © 2005 C H Forsyth Verbatim copying and distribution of this entire article are permitted worldwide, without royalty, in any medium, provided this notice, and the copyright notice, are preserved. Libre Software Meeting, July 5-9, 2005, Dijon, France. -22. Resources as files In Plan 9, system resources are accessed through a hierarchical name space: a tree of names, similar to a Unix file hierarchy, with directories at interior nodes and files at the leaves. Both files and directories have user/group/other access permissions and a few other attributes. Resources represented this way include:  devices (from UARTs to network interfaces)  networks  protocols  services  control and debugging of processes  system management  service discovery and naming  graphics  permanent storage structures (on disk, flash, memory, and network) Operations on resources that would be specialised system calls or object methods in other systems are instead implemented using patterns of open-read-write-closeon files and directories within the name space representing the resource. Examples are given below. Unusually, but critically for the design, there is not a single global name space: instead a process or group of processes assembles the name space it needs for the application it implements, such as the user interface or a network gateway. The name space is not derived from a permanent copy but is computable: built and rearranged dynamically using three system calls: bind, mount and unmount. (There is a further primitive that controls sharing of name spaces between processes, discussed below.) The bind call: int bind(char *name, char *old, int flag) takes the names of two existing files or directories. It makes the thing that name refers to (file or directory) visible at the name old. By default, name hides the previous contents of old; if both are directories, however, the flag can make old into a union directory that includes the contents of name either before or after the contents of old. Thus, bind("/usr/forsyth/lib", "/lib", MBEFORE); searches my private library before the system one. (The ‘union’ is actually a sequence of the names in the relevant directories, not a union of complete substructures.) The unmount call: int unmount(char *name, char *old) undoes the effects of a previous bind. A file server in Plan 9 is any program that manufactures a name space by implementing the server-side of the system’s file service protocol, 9P. Most of the significant functionality in the system, including devices, networks and graphics, is provided by file servers. The mount system call: int mount(int fd, int afd, char *old, int flag, char *aname); mounts the name space generated by the file server on file descriptor fd on an existing directory old in the current name space, either adding to or replacing the existing contents of the chosen mount point (as determined by flag, just as for bind). The aname is a string interpreted by the file server; some use it to select from a set of possible trees implemented by the server. Client programs subsequently access the server using the familiar system calls on that part of the resulting name space: a kernel component (the ‘mount driver’) converts file system operations in that space to 9P messages on the file descriptor, which are then acted upon by the file server. -3One user-level file server, exportfs, allows a name space to be exported on a file descriptor. Exportfs reads 9P requests from the file descriptor and executes corresponding system calls in its own name space, sending the results of the system calls in 9P replies. That one mechanism provides the basis for distribution of all resources and services. 9P was originally introduced to allow clients to share a conventional file system, much as with Sun’s Network File System.19 As Pike and Ritchie observe,20 the design breakthrough for Plan 9 was to realise that given 9P, once resources are implemented as file systems, they can readily be exported, uniformly, building a distributed system with less fuss. 3. Designer’s names Although Plan 9 has the usual collection of Unix-likecommands and filters, it offers the new possibility of implementing system and application functions as file servers. Indeed, even the kernel representation for devices, services, network interfaces, and protocols is the same: they are all kernel-residentfile servers. Thus the design of a given component, whether device driver, system service or application, often begins by designing a suitable name space, at a level of abstraction above that of (say) the API for any particular programming language. In other words, in Plan 9, the name space provides the focus for design. We look at a reasonable collection of examples, then describe the underlying protocol that links everything together. Example: network interfaces Plan 9 does not provide special ‘socket’ system calls to access networks. Instead, devices, interfaces, protocols and services are all represented by file trees, conventionally collected together under /net. Figure 1 shows a subset of the /net directory on my Thinkpad terminal. Most of the entries are from an instance of the file server that represents an Internet protocol stack (not all its protocols are shown above), but ether0 represents the wired Ethernet connection, and ether1 represents my wireless connection, both provided by separate instances of the file server type that represents Ethernet devices. The ether directories show a common Plan 9 naming convention. A file server that multiplexes many connections represents each connection as a numbered directory, each with a ctl file to control that connection, and data to access that connection’s data or messages. The ctl file when read contains the number of the directory in decimal (ie, as text). Finally, opening the clone file is equivalent to opening the ctl file of an unused or newly-allocated connection directory. Thus, to get a new connection, open the clone file, and read it to find the directory number and access the other files. Textual messages on the ctl file do the work of special system calls in other systems. The file ownership and permissions prevent undesired access to the connections of other users on multi-user systems. The ether connections correspond to different ethernet packet types; the type for a given connection is set by writing connect n to its ctl file. Packets of that type can then be read and written as binary messages in standard Ether format on the data file. The file /net/ether0/addr contains the MAC address of the interface. The ipifc directory configures network interfaces into the Internet protocol subsystem. A new interface is allocated by opening the clone file, and writing a control message to bind a given medium’s interface into the IP stack, in a way appropriate to the medium. For instance: echo bind ether /net/ether0 >/net/ipifc/clone will allocate a new connection in ipifc, and use /net/ether0 as a device that adheres to the rules for an ether medium. Other media include pkt, which allows a user program to read and write the connection’s data file to send and receive IP packets, and netdev, for a device that consumes and produces IP packets. The pkt interface can be used for PPP and NATP implementations. Subsequently, other control messages can be written to set Internet addresses and masks for the interface: -4- /net/ arp ether0/ addr clone ifstats stats 0/ ctl data ifstats stats type ... ether1/ ... ... ipifc/ clone stats 0/ ctl data err listen local remote snoop status ... iproute tcp/ clone stats 0/ ctl data err listen local remote status 1/ ... ... ... Figure 1. Subset of /net on a terminal with several Internet network interfaces. echo add 144.32.112.70 255.255.254.0 0 >/net/ipifc/0/ctl Other properties can also be set by control messages. The current routing table for the interface is presented as text on iproute, which also accepts textual messages to change it. Individual protocols, as configured, have their own subdirectories in the IP interface directory, thus tcp, udp etc. Outgoing connections are made by writing connect address!port to the ctl file. The write gives a diagnostic if the connection cannot be made; otherwise, read and write the data file to exchange data on the connection. A process listens for incoming calls by first writing announce port to ctl, and then opening the listen file. The open blocks until an incoming call is made. It then returns a file descriptor open on the ctl file for a new directory for that connection. For an active connection, the local and remote files contain text representing its endpoint addresses. Only one instance of the IP stack server appears above, and that is the usual case, but there can be -5more than one, each serving a complete and self-contained IP stack. Firewalls and gateways often use more than one stack, each serving different sets of physical interfaces, such as an ‘inside’ and ‘outside’ Ether, with different sets of services offered on each. Packets cannot ordinarily move from one interface to another, but a filtering firewall can mount both interfaces in its name space, give itself a pkt interface on each stack, and copy packets between them, filtering and adjusting addresses as required. Example: domain name service The Domain Name System implementation on Plan 9 is a file server, ndb/dns. It serves one file, which conventionally appears at /net/dns. An application that wishes to translate between a name and resource opens that file, writes the name and resource type (as a single line of text), and reads successive possible translations. If the name cannot be translated, the write returns an error. A small utility program ndb/dnsquery allows us to work some examples: term% ndb/dnsquery > www.vitanuova.com www.vitanuova.com ip 193.195.70.8 > vitanuova.com mx vitanuova.com mx 10 smtp.vitanuova.com > www.cnn.com cnn.com ip 64.236.16.52 cnn.com ip 64.236.16.116 cnn.com ip 64.236.24.20 cnn.com ip 64.236.24.12 cnn.com ip 64.236.24.4 cnn.com ip 64.236.16.20 cnn.com ip 64.236.24.28 cnn.com ip 64.236.16.84 > 64.236.16.52 52.16.236.64.in-addr.arpa ptr www4.cnn.com > www.wotsmyname.com !dns: name does not exist One advantage of this implementation is that the cache of DNS translations is quite naturally shared by all applications, without special handling. As it happens, the DNS service is most often not used directly by applications, but by means of the more general connection service, described next. Using a file server to hold shared state, as dns does, is a common one. For example, the mail ratification service ratfs provides a persistent representation of the system’s spam blocking list, serving a name space that allows the list to be queried by many concurrent instances of smtpd, and updated by simple shell scripts using echo, or more elaborate spam filters, simultaneously. As another example, a Plan 9 Usenet news server maintains the history of article IDs as an inmemory database, and serves files such as msgid and post that allow efficient query and update by many clients that are filing different incoming news streams. Example: connection service The connection service ndb/cs translates a symbolic network address into a set of recipes for connecting to that service. Network addresses in Plan 9 are text with a standard form: [ network ! ] netaddr [ ! service ] Network identifies a particular network type or protocol, such as il (a reliable datagram protocol), tcp, udp, telco, ether0, and others. In general, it is the name of a directory under /net, so the set is extensible. The network name net is the default, and means ‘any supported network’. Service is often a symbolic name, even when the protocol itself uses port numbers. This syntax works for IP, datakit, x.25, atm, IrDA, ether, telephones, and others. The connection service uses its knowledge of the available networks and underlying naming systems (such as DNS -6and Plan 9’s own network database) to work out possible ways of converting the components of the symbolic address to physical network addresses. It serves a single file, conventionally /net/cs. A client writes to the file the symbolic name of the service desired; each subsequent read returns a set of instructions for one possible way of making the connection, expressed as operations on other files in the name space. For example, to translate the symbolic address net!doppio.bloggs.com!9fs a client process opens /net/cs, writes that string to it, and if the write succeeds, reads back a list of recipes, as text, one for each read: /net/il/clone 200.1.1.67!17008 /net/tcp/clone 200.1.1.67!564 Each is interpreted as ‘open the file named by the first field and write the value of the second field into it’. The effect is to make a network connection to the desired service. There can be several such recipes (one per read) if the destination address has more than one translation (eg, a host with several network interfaces). In the case above, the IL/IP protocol can be used as well as TCP/IP; that will be attempted first. Note that the server can distinguish opens by different clients, and thus a client sees only the recipes for its own requests. If a name cannot be translated, the server returns an error to the client on the initial write. When an Internet domain name appears in an address, cs uses /net/dns described above to translate it to one or more numeric Internet addresses. Example: mail boxes Access to mail boxes is through a name space presented by the program upas/fs. Each mail box is represented by a directory in the server’s name space, conventionally mounted on /mail/fs. It serves up mail boxes in the structure shown below: -7- /mail/fs/ ctl mbox/ 1/ bcc body cc date digest disposition filename from header ... raw rawbody rawheader ... 2/ bcc body cc ... 1/ bcc body ... 2/ bcc body ... Each message appears as a directory. The files in each message directory contain the parsed elements of the body and header of the corresponding message. For instance: % cat /mail/fs/mbox/782/subject Re: [iPAQ] A working 802.11 / AP combination? Access to the unparsed bytes of body, header or the whole message is provided through the several raw files. The ctl file in the root directory accepts requests to delete one or more mail messages, or the message directory can simply be deleted by the remove system call (or rm command). Attachments are represented by subdirectories with the same structure, and so on recursively. Mail reading applications are easy to write, because the correct parsing of the RFC822 mail header, interpretation of the MIME structure, and character set conversion, all of which is nontrivial, is done centrally by upas/fs. For instance, there is a little library of shell functions that supports virus and spam detection, using grep -s and other commands, including specialised ones such as upas/bayes, applied to files in the mailbox name space. Furthermore, several mail readers can access the same mail box simultaneously with consistent results as new messages arrive and others are deleted. Note that although upas/fs presents a file-systemlike view of its internal mail box representation, that representation is not actually a collection of files and directories on disc, and the contents of each part of the hierarchy is provided only in response to a client’s request. For instance, when a client reads the subject file, a message is sent on its behalf to upas/fs, which sends a reply message containing a value copied directly from an internal string value that holds the parsed subject line. Directory entries are similarly generated on demand. -8Example: authentication Two file servers support authentication. One runs only on the authentication server, in a secured environment. The other runs on other Plan 9 nodes and acts as the authentication agent for users and servers, holding both sets of keys and all knowledge of authentication protocols. Keyfs runs on the authentication server(s) for a Plan 9 authentication domain. The current Plan 9 authentication protocol is based on secret keys (eg, pass phrases). The secrets are stored, labelled by user, in a record-orientedfile, encrypted by a master key. (In fact, a hash of the secret is stored, not the original value.) Keyfs decrypts the file and serves the following name space: username/ key log status expire ... In the two-level hierarchy, the top level directories are named after the registered users (eg, forsyth, rog, etc). The authentication data for each user is found in the corresponding directory. The file key contains the authentication key for that user (eg, a ‘shared secret’). If the account is disabled, a read request returns an error. A key can be changed by writing to the file. The account’s expiration date can be read or written in the file expire. The keyfs hierarchy is mounted on /mnt/keys in a name space created to run only authentication services. They can access the authentication data for users, but do not know the master key. Furthermore, several services can access the data simultaneously without risking its corruption, because keyfs acts as the master file’s monitor. Applications in all other name spaces see an empty directory at /mnt/keys (because keyfs is not mounted in those name spaces). Factotum is a general authentication agent, with at least one instance per user. It is the only application in the system that knows the details of a dozen or so authentication protocols, including Plan 9’s own, and others such as apop or ssh. It serves the following two-levelname space, typically union-mounted on/mnt: factotum/ rpc proto confirm needkey log ctl The proto file lists the authentication protocols implemented by this instance of factotum. The ctl file is written to install and remove keys: cd /mnt/factotum # add a key: echo key ’user=badwolf’ ’service=ftp’ ’server=unit.org.uk’ \ ’!password=buffalo’ >ctl # remove all matching keys: echo delkey ’server=unit.org.uk’ >ctl When an application wishes to authenticate to another service, it opens the rpc file and follows a simple protocol (implemented by a C library function). The client writes rpc to tell factotum the desired authentication protocol, user, service etc., and then it acts as a proxy, forwding messages between the authenticating service and factotum, until authentication is complete. Only factotum knows the protocol details: the application simply follows factotum’s instructions on the phasing of read and write, and data contents. Furthermore, for secure protocols, the application cannot see the user’s keys. Of course, not all keys will be preloaded in general. It is sometimes necessary to prompt for them. The needkey file can be opened by another program, typically auth/fgui, and read to -9receive requests for keys. The read blocks until factotum needs a key it does not yet hold; it then replies to the read with a description of the desired key, fgui prompts the user in any way desired and writes the key (if the user supplies one) to /mnt/factotum/ctl as above. Example: window management Graphics is ultimately provided through the draw device, a kernel file server that serves a threelevel hierarchy: draw new 0/ ctl data colormap refresh 1/... ... Each connection to the device is represented by a numbered directory, following a similar scheme to the network devices. A bitmap graphics application opens /dev/draw/new to allocate itself a new connection, reads the connection number, and opens the other files as needed. The data file implements the graphics protocol: messages are read and written by the client to allocate and free images and subfonts, draw on images (on and off screen), and so on. The draw device uses a variant of Pike’s layers21 to allow many applications to write to overlapping windows on the screen without confusion, thus multiplexing graphics output, but it does not provide window management or multiplexing of mouse and keyboard input. That is done by a separate window manager, rio, which is also a file server. It works in a similar way to the older 8½,4 but more efficiently because the separate draw device multiplexes graphics output directly. Each window is given its own name space, in which rio serves the names cons and mouse (amongst many others). When a textual application such as the shell opens /dev/cons, it will open the one for that window, served by rio. When a graphical application opens /dev/mouse, it will open the one for that window. When it starts up, rio opens /dev/cons and /dev/mouse in the name space, and follows rules such as ‘button 1 selects a window’ to decide which client window should receive the data it reads from those files on the client window’s /dev/cons and /dev/mouse. This recursive structure allows an instance of rio to run inside a rio window. Acme5 is an unusual application that centralises user interaction for text-orientedapplications. Its window on the screen is tiled with many text windows, labelled with a file or directory name. Each of the three mouse buttons selects text, but does something different with it. Briefly, button 1 selects text for editing, button 2 executes it as a command (perhaps built-into Acme), and button 3 takes the selected text something to locate or retrieve (eg, a file or directory, or text in a file). Acme serves a single name space containing a few top-levelfiles and directories, and a numbered directory for each such window: - 10 - /mnt/acme/ acme/ cons consctl draw editout index label new 1/ addr body ctl data editout errors event rdsel wrsel tag xdata 2/ ... ... Its name space is quite different from ,rio’s even though both are window managers, because the level of interaction is different. Acme supports only text windows, and its clients interact with the user through high-leveloperations through Acme files corresponding to the client’s text windows. For instance, a regular expression can be evaluated on the text in a given window by writing it as text to the corresponding addr file; a read returns a textual description of the start and end characters of the next match. Reading the data file retrieves the matching text; writing to data replaces it with the written text. The event file, however, is the key one for most clients. Changes to the text in the window (or its label), for instance by the user’s editing actions, are reported as messages on the event file. Button 1 editing operations are handled as normal by acme and simply reported. Operations by button 2 (execute) and button 3 (search/retrieve) produce messages giving the selected text and other parameters, leaving it up to the application to implement. For example, the mail reader produces a list of mail messages in one window, and when one of those is selected by button 3, it receives a message on event, retrieves the selected mail message from upas/fs, and puts its text in a new Acme window. That window has some mail-specificcommands added to its label: Reply, Delmesg and Save. When the user clicks one of those with button 2, the client receives a suitable message on event and can take appropriate action, such as popping up a new text window to receive the reply text. (A client can also write back an event message to acme to have it take its default action.) Acme’s clients include general text editing, debugging, mail reading, Usenet news, and others all operating through the files in the name space above. Example: storage formats File servers are also used in a more conventional way to make the contents of various archive and file formats (eg, tar, zip, ISO9660, DOS) available through the name space. Less conventionally, the interface to the FTP protocol is also a file server, ftpfs, that makes the remote FTP archive visible in the local name space directly as files and directories, and subject to existing shell commands and system calls, rather than accessible only through a special command language. The primary permanent (ie, conventional, disc-based) file storage is provided through a file server fossil.7 It is an ordinary application program, not built in to the kernel. It has two unusual properties: it uses an underlying specialised block-orientedarchive service, venti,6 to store its data and, taking advantage of that, and an internal copy-on-writestorage structure, - 11 efficiently maintains snapshots of its whole file hierarchy, made automatically at nominated times, where each snapshot is directly accessible through the name space. This provides automated backup and a simple way to see changes over time: term% diff /n/dump/2005/0201/sys/src/cmd/p.c /sys/src/cmd 36c36 < fprint(2, "p: can’t open %s0, argv[0]); --> fprint(2, "p: can’t open %s - %r0, argv[0]); As well as using diff and grep, and of course cp to fetch things from the backup, it is possible to bind whole sections of the past into the present name space, for instance to do regression checks on the compilers and libraries, or to compare present and past performance. The underlying Venti archives can be copied to guard against disaster. 3.1. Discussion File servers in can act as the interface to all manner of resources, as suggested briefly by the examples above. In Plan 9, file servers can rely on the following properties:  the server sees all operations by each client in its space  clients can be given per-client views of the underlying service or data  many clients can safely share the same data through the file server  a single server can provide access to many independent data files for many clients  complex parsing and concurrency control can be factored out of applications into the server, making both server and application easier to write  as upas/fs and keyfs show, the actual format of underlying data files is hidden from applications  files have ownership and permission, enforced as desired by the file server itself  servers can support indexing, queries, replication and transaction control Furthermore, the services presented through the name space can be securely exported to the network using standard mechanisms in the operating system, transparently to both application and server. The protocol itself is straightforward (except for flush), and there is a 9p library that helps make file servers easier to write. Most of the work in upas/fs is handling RFC822 messages and MIME, not presenting the file system view. The exposing of data interfaces through the name space has a further advantage. Since the hierarchy is in the normal name space, ordinary system commands can operate on it, including ls, cmp, and cp. This is helpful during implementation and debugging. At Vita Nuova, when we wish to develop client and server sides concurrently, we simply agree a common name space and suitable messages on the files therein. The server is often tested using shell commands or shell scripts; the client is tested by using a mixture of static files and a rapid prototype of the server’s name space (eg, allowing its interaction to be scripted). The biggest benefit, of course, was mentioned earlier: all the resources described thus far, all the devices and interfaces provided as file servers by the kernel, and other file serving applications not mentioned, can have their resources distributed on the network, using an appropriate combination of mount and exportfs. The import command parcels up one such instance: import host remotefs [ mountpt ] It dials exportfs running as a service on host, does mutual authentication, requests the service to apply exportfs to the name remotefs in the service’s name space, then mounts the resulting network connection on mountpt locally. Thus, to turn use another machine’s network interfaces: import -b mygateway /net The -b option puts the remote’s /net in a local union mount, before the local machine’s own - 12 interfaces. Thus, a request to use a given protocol will use the remote’s if it has it, and the local one otherwise. Similarly, one can import services from other machines (eg, by importing /net/cs or /net/dns). Remote graphics is similar. Plan 9’s cpu command is typically used to connect a window on a terminal to a cpu server. It dials a service on the server, authenticates, then uses exportfs to export the terminal’s name space to the cpu server, which mounts it on /mnt/term. It then binds /mnt/term/dev over /dev so that the terminal’s devices will be used, not the cpu server’s own, and starts a shell. The shell opens /dev/cons, but that’s bound from /mnt/term/dev/cons, so input and output appear in the terminal window. More important, running a graphics program will open a connection to /dev/draw and /dev/mouse which will go over the mounted connection via exportfs into the terminal’s exported name space, thus doing the graphics in the terminal window. In particular, running rio in the window will start a window system within that window but the windows created there will have its programs running on the cpu server. Obviously the same extends to remote audio, other remote devices, factotum, mail boxes, grid scheduling, etc. 4. The protocol The name space is implemented using a single protocol, 9P. It is defined by a few pages in section 5 of the Plan 9 Programmer’s Manual, Volume 1,22 but here is a brief summary. The protocol provides 13 operations, shown in Table 1. _ __________________________________________________________________________________  Tversion tag msize version  start a new session  Rversion tag msize version    optionally authenticate subsequent attaches   Tauth tag afid uname aname  Rauth tag aqid  attach to the root of a file tree  Tattach tag fid afid uname aname   Rattach tag qid   Twalk tag fid newfid nwname nwname*wname  walk up or down in the file tree    Rwalk tag nwqid nwqid*wqid  open a file (directory) checking permissions   Topen tag fid mode  Ropen tag qid iounit   Tcreate tag fid name perm mode  create a new file  Rcreate tag qid iounit    read data from an open file  Tread tag fid offset count   Rread tag count data  write data to an open file  Twrite tag fid offset count data   Rwrite tag count   Tclunk tag fid  discard a file tree reference (ie, close)    Rclunk tag  remove a file  Tremove tag fid   Rremove tag   Tstat tag fid  retrieve a file’s attributes  Rstat tag stat    set a file’s attributes  Twstat tag fid stat   Rwstat tag   Tflush tag oldtag flush pending requests (eg, on interrupt)   Rflush tag   Rerror tag ename  error reply with diagnostic _ __________________________________________________________________________________ Table 1. 9P messages T-messagesare the requests sent from client to server, with a tag unique amongst all outstanding messages. The R-messagesare sent in reply, with the tag from the original T-message(that tag value can then be reused). A fid is a number chosen by the client to identify an active file (directory) on the server. The association is made by Tauth, Tattach, or Twalk, and lasts until it is - 13 cancelled by Tclunk or Tremove. The fid in Tattach is associated with the server’s root; the newfid in Twalk is associated with the result of walking from directory fid using each wname in turn. The other requests use a fid to identify the file or directory to which they apply. The server can respond to requests out of order, and might defer a reply until some other event occurs (eg, a read of /dev/cons waits until the user types something). The client can cancel any outstanding request using Tflush, giving the tag of the cancelled request. The protocol is defined to handle correctly cases such as a flush arriving after reply, or a tag being flushed more than once. A server’s files are uniquely identified by a 13-byteqid, which it determines (it has some substructure and attributes not discussed here). Two files are identical if they have the same qid. (Note that unlike Unix i-node numbers, one cannot access afile using only a qid.) The qid can be used to detect stale data when caching data or files. Outside the kernel, over pipes and on the network, the protocol has a well-definedand compact representation as a stream of bytes. The protocol is not an Internet protocol, but does need reliable, in-sequencedelivery. Each message is preceded by a 4 byte message length to delimit it in the stream, the message type is one byte, tags are two bytes, fids are four bytes, counts are four bytes, qids are 13 bytes, and strings and stat data are preceded by a two-bytecount. The code to marshal and unmarshal the structures is simple, obvious and small. An incoming connection is optionally authenticated, with encryption and digesting engaged if desired. That operation lies outside 9P proper. Within the protocol, the server can require that each Tattach be authenticated. If so, the client does a Tauth which creates an authentication file on the server, referred to by afid, which can then be read and written using Tread and Twrite to exchange the data in any agreed authentication protocol. If successful, the afid can then be presented in a subsequent Tattach to authorise it. Both connection level and Tauth authentication is typically done using factotum. 5. Implementation environment We saw above that Plan 9 has ‘file servers’ at the heart of its design and implementation. They rely, however, on aspects of the system that are available to all applications. The remaining sections look briefly at the programming environment, the support for concurrency, and the Plan 9 kernel. Apart from a few venerable utilities derived from Research Unix, all the code is new and unrelated to the commercial Unix code; even the older programs have usually been changed, for instance to support Unicode, Plan 9’s native character set. In particular, the Plan 9 kernel code, including the networking code, is completely new. It was designed with symmetric multiprocessor support from the start. Even so, the kernels have seen several significant revisions, mainly in the networking code. For example, the first two editions provided a Stream I/O subsystem similar to that of 8th Edition Unix (and much simpler than System V’s), which was used in various device drivers, including the protocol stack. The Third Edition replaced that by a simpler, more efficient abstraction for queued I/O. The networking subsystem also became more modular. The system is written to be portable. Assumptions about architectural details, such as the structure of the memory-management unit, are confined to a small amount of machine-dependent code. Indeed, unlike some ‘portable’ systems, the aim is to map the software requirements as efficiently as possible into the real processor, not to reflect the processor’s nature into the higherlevel software. It is also trivial to compile for any architecture on any others, and debug one platform from any other. The kernel and nearly all applications are implemented in Plan 9’s dialect of C. It is not POSIX compatible, because the libraries and programming conventions were done anew for Plan 9. For instance, the structure of libraries and header files is much more straightforward. There are new libraries: libbio for buffered IO; lib9p for file server implementation; libsec and libmp for encryption; libauth for authentication; libplumb to interact with the message exchange; libdraw for bitmap graphics; libthread for concurrent programming (see below); and others. - 14 Plan 9 uses its own C compiler suite, written by Ken Thompson, which itself has unusual structure. See Thompson’s paper for details11 There is good support for development for a system with heterogeneous architectures (both compilation and debugging). In general, the source code to an application is invariant across architectures. There is no use of #ifdef or ./configure to parametrise the source code for any particular target architecture. Application code (and most kernel code) is simply written not to depend on architectural details. Programs such as the debuggers that need to manipulate executables and other machine-dependent data, do so through the library libmach. Remote debugging is done by importing the /proc file system from the remote machine into the debugger’s name space. They can of course be different architectures. A bigger difference is the good support at both library and kernel level for concurrent programming. That support is essential and well-used:many applications are concurrent programs. File servers are often concurrent programs to allow them to handle many clients at once, particularly when some requests might block. For example, ndb/cs creates a new process for each request that requires use of DNS or similar services. DNS also creates a process for each request than cannot be answered immediately from its cache. The window system rio, the text window system acme, exportfs and the plumber are all concurrent programs. Two system calls, rfork and rendezvous, provide all the necessary kernel support. Rfork takes a bit mask that controls the sharing (or not) of resources of the calling process with its parent and children: pid = rfork(RFPROC|RFMEM|RFFDG); Resources include file descriptors, name space, environment variables, rendezvous group (see below), and memory. Rfork can optionally create a new child process. The parent also chooses which of its other resources will be shared with that child. In particular, a child can optionally share the parent’s memory segments. It always has a private copy of the stack to that point, avoiding the need for assembly-language linkage,and also providing some process-privatememory. Rendezvous allows two processes that share the same rendezvous group to synchronise: void* rendezvous(ulong tag, void *value); One process calls rendezvous with an agreed tag value; it blocks until another process does the same. At that point, the values are exchanged between the processes, and each call returns the other process’s value. The meaning of the value is up to the programmer, although it is often a pointer to a shared data structure. All other concurrent programming operations are built on rendezvous. Two styles of concurrent programming have library support. The Plan 9 C library provides spin locks, queued locks, and reader/writer locks, for primitive shared memory programming. Most concurrent applications, however, use a newer library that allows processes in a shared address space (as produced by rfork) to communicate and synchronise by sending and receiving values on typed channels, in the style of Hoare’s Communicating Sequential Processes (as subsequently implemented by occam and a few other languages). The basic operations are send on a channel, receive from a channel, and alt to send or receive to one of a number of channels (the first that is ready, allowing a process to interact with several senders on different channels). The library provides both synchronous (unbuffered) and buffered channels. Plan 9 itself has only one kind of process (the aim is to keep that sufficiently lightweight). The library adds support for threads, but they are coroutines, not processes, and scheduled non-preemptively(within a given Plan 9 process, which can itself be preempted). Threads and processes can however interact through channels. The debugger acid has a library in the acid language that supports debugging concurrent applications. Most new non-trivial concurrent programs use the channel-based library. - 15 6. Kernel implementation The kernel has an invariant core that includes the implementation of kernel and user processes, memory allocation, virtual memory, queued I/O support, and name space primitives, in about 36 source files (including some optional modules), A further 27 files provide platform-independent kernel file servers, including for essential functionality. The kernel supports symmetric multiprocessing, using kernel processes (which can be preempted) cooperating through spin locks, basic queued locks, queued locks providing multiple readers and exclusive writing, and multiprocessor sleep and wakeup. Processes are assigned dynamically to one of the available processors, taking account of their previous affinity, to reduce cache and TLB flushing on some hardware. Processes are preemptable, and scheduled by priority. There are 20 priority levels, split into bands (user, kernel and root kernel processes), with dynamic priority adjustment in the user process band by default, though that can be adjusted through a control file. Real-time support is provided by a variant of EDF (Earliest Deadline First), using deadline inheritance to handle shared resources.15 It is enabled and controlled for a process or process group using the /proc file. Device driver routines are invoked by a process context (either a kernel process or the kernel context of a user-modeprocess in a system call). Thus a single device driver serves numerous client processes. Interrupts are invoked outside process context (either on the current process stack or on a special stack). It is up to the machine-dependentcode how and when rescheduling takes place on an interrupt. Within the kernel, the 9P protocol is implemented directly by function calls; that constitutes the only interface between the kernel and its devices and protocols. The functions in the current interface are shown below: struct Dev { int char* dc; name; void (*reset)(void); void (*init)(void); void (*shutdown)(void); Chan* (*attach)(char*); Walkqid*(*walk)(Chan*, Chan*, char**, int); int (*stat)(Chan*, uchar*, int); Chan* (*open)(Chan*, int); void (*create)(Chan*, char*, int, ulong); void (*close)(Chan*); long (*read)(Chan*, void*, long, vlong); Block* (*bread)(Chan*, long, ulong); long (*write)(Chan*, void*, long, vlong); long (*bwrite)(Chan*, Block*, ulong); void (*remove)(Chan*); int (*wstat)(Chan*, uchar*, int); void (*power)(int); /* power mgt: power(1) => on, power (0) => off */ int (*config)(int, char*, DevConf*); }; A Chan corresponds to a fid in the protocol. The separate structure Walkqid represents a variable-lengtharray of Qid values. There are a few extra functions to allow the file server to initialise its own state and that of a device it is driving; for power management and dynamic configuration; and bread and bwrite to read and write data directly in the Block representation used by the network subsystem and other device drivers. A kernel library helps implement name spaces, parse control messages, and manage queues of data. Some kernel file systems have their own interfaces below them. For example, a single devether.c serves the standard name space expected of the ether medium, and implements - 16 its control messages. Each physical implementation has its own device-specificcode, which the file server invokes through the following interface: typedef struct Ether Ether; struct Ether { ... void void void void long long void void void (*attach)(Ether*); /* filled in by reset routine */ (*detach)(Ether*); (*transmit)(Ether*); (*interrupt)(Ureg*, void*); (*ifstat)(Ether*, void*, long, ulong); (*ctl)(Ether*, void*, long); /* custom ctl messages */ (*power)(Ether*, int); /* power on/off */ (*shutdown)(Ether*); /* shutdown hardware before reboot */ *ctlr; Queue* oq; ... }; It separates the card-specificcode from the file-servingcode. Devether.c itself uses a library of functions that helps implement the bulk of a generic ‘network interface’ name space, leaving it to deal with the Ether-specificwork. A similar approach is taken with some other device drivers, such as those for UARTs and USB. 7. The distribution The Plan 9 distribution includes all the current compiler suites, and the source code for the kernel and applications. It is self-supportingand does not require any other software to run, compile or develop it. The Plan 9 compiler suite11 currently supports 680x0, various MIPS, ARM/Thumb, Alpha, x86, AMD64, SPARC, and PowerPC. (Other implementations have existed but are now obsolete.) The free distribution includes the following kernels (in /sys/src/9): alphapc bitsy mtx pc ppc Alpha PC 164 iPAQ 36xx PowerPC 603 Intel/AMD/Other x86 PowerPC 8260 or 750/755 The kernels other than pc are included as samples of the system running on non-Intelarchitecture; they usually target specific hardware. 8. Resources ... but as he climbed in over the rail, so a waft of air took the frigate’s sails back, a breath of heavy air off the land, with a thousand unknown scents, the green smell of damp vegetation, palms, close-packedhumanity, another world.  Patrick O’Brian, HMS Surprise The whole system can be downloaded free of charge in PC-bootable ISO9660 format; see plan9.bell-labs.com/plan9. The site also offers online updates to Plan 9 systems once installed, using an authenticated 9P connection. The documents and manual pages are included in machine-readableform; they are also available in printed form, ordered from www.vitanuova.com/plan9. - 17 References 1. Rob Pike, Dave Presotto, Sean Dorward, Bob Flandrena, Ken Thompson, Howard Trickey Phil Winterbottom, ‘‘Plan 9 from Bell Labs,’’ Computing Systems 8(3), pp. 221-254(Summer 1995). 2. Dave Presotto Phil Winterbottom, ‘‘The Organization of Networks in Plan 9,’’ Proceedings of the Winter 1993 USENIX Conference, San Diego, California, pp. 271-280 (1993). 3. Dave Presotto Phil Winterbottom, ‘‘The IL Protocol,’’ Plan 9 Programmers Manual, Fourth Edition 2, Bell Labs Lucent Technologies (2002). 4. Rob Pike, ‘‘8½, the Plan 9 Window System,’’ Proceedings of the Summer 1991 USENIX Conference, Nashville, pp. 257-265 (1991). 5. Rob Pike, ‘‘Acme: A User Interface for Programmers,’’ Proceedings of the Winter 1994 USENIX Conference, San Francisco, California, pp. 223-234 (1994). 6. Sean Quinlan Sean Dorward, ‘‘Venti: a new approach to archival storage,’’ First USENIX Conference on File and Storage Technologies, Monterey, California (2002). 7. Sean Quinlan, Jim McKie Russ Cox, Fossil, an Archival File Server, Lucent Technologies Bell Labs, Unpublished memorandum (September 2003). 8. Sean Quinlan, ‘‘A Cached WORM File System,’’ Software: Practice and Experience 21(12), pp. 1289-1299 (1991). 9. Ken Thompson, ‘‘The Plan 9 File Server,’’ Plan 9 Programmers Manual, Fourth Edition 2, Bell Labs Lucent Technologies (2002). 10. Rob Pike, ‘‘Plumbing and Other Utilities,’’ Proceedings of the USENIX Annual Conference, San Diego, California, pp. 159-170 (June 2000). 11. Ken Thompson, ‘‘Plan 9 C Compilers,’’ Proceedings of the Summer 1990 UKUUG Conference, London, pp. 41-51 (1990). 12. Phil Winterbottom, ‘‘Acid: A Debugger Built From A Language,’’ Proceedings of the Winter 1994 USENIX Conference, San Francisco, California, pp. 211-222 (1994). 13. Rob Pike, Dave Presotto, Ken Thompson Gerard Holzmann, ‘‘Process Sleep and Wakeup on a Shared-memory Multiprocessor,’’ Proceedings of the Spring 1991 EurOpen Conference, Tromsø, Norway, pp. 161-166 (1991). 14. Rob Pike, ‘‘Lexical File Names in Plan 9, or Getting Dot-Dot Right,’’ Proceedings of the USENIX Annual Conference, San Diego, California, pp. 85-92 (June 2000). 15. Pierre Jansen, Sape Mullender, Paul J M Havinga Hans Scholten, Lightweight EDF Scheduling with Deadline Inheritance, University of Twente (2003). 16. Russ Cox, Eric Grosse, Rob Pike, Dave Presotto Sean Quinlan, ‘‘Security in Plan 9,’’ Proceedings of the 11th USENIX Security Symposium, San Francisco, pp. 3-16 (2002). 17. Rob Pike Ken Thompson, ‘‘Hello World, or š±»·¼­Á± ºÌüµ, ...,’’ Proceedings of the Winter 1993 USENIX Conference, San Diego, pp. 43-50 (1993). 18. Rob Pike, Dave Presotto, Ken Thompson, Howard Trickey Phil Winterbottom, ‘‘The Use of Name Spaces in Plan 9,’’ Proceedings of the 5th ACM SIGOPS European Workshop, Mont Saint-Michel (1992). 19. R Sandberg, D Goldberg, S Kleimann, D Walsh B Lyon, ‘‘Design and Implementation of the Sun Network File System,’’ Proceedings of the Summer 1985 USENIX Conference, Portland, Oregon, pp. 119-130 (June 1985). 20. Rob Pike Dennis M Ritchie, ‘‘The Styx Architecture for Distributed Systems,’’ Bell Labs Technical Journal 4(2), pp. 146-152 (April-June 1999). 21. Rob Pike, ‘‘Graphics in Overlapping Bitmap Layers,’’ Transactions on Graphics 2(2), pp. 135160 (1982). - 18 22. Plan 9 Programmers Manual, Fourth Edition (Manual Pages), Bell Labs Lucent Technologies (2002).