A Cached WORM File System Sean Quinlan† AT&T Bell Laboratories Murray Hill, New Jersey 07974 ABSTRACT This paper describes a general-purpose file system that uses a writeonce-read-many (WORM) optical disk accessed via a magnetic disk cache. The cache enables blocks to be modified multiple times before they are written to the WORM and increases performance. Snapshots of the file system can be made at any time without limiting the users’ access to files. These snapshots reside entirely on the WORM, are accessible to the user via a second read-only file system, do not contain multiple copies of unchanged data, and can be used to rebuild the file system in the event that the disk cache is destroyed. The file system has been implemented as part of Plan 9, an experimental operating system under development at AT&T Bell Laboratories. Keywords File systems Caches WORM Backup Introduction Write-once optical disk technology is an excellent medium for data backup and archival. Write-once optical disks have a price per bit similar to magnetic tapes, allow efficient random access to large volumes of data, and are more compact. The advent of jukeboxes, such as the fifty-disk Sony Writable Disk Auto Changer, now makes it feasible to hold hundreds of gigabytes on-line. Write-once optical disks belong to the class of write-once-read-many (WORM) devices. A particular block on a WORM can be written only once but can be read many times. Conventional file systems are designed under the assumption that blocks on storage devices can be written more than once. This difference poses the question of how a WORM device can be used to advantage in a file system. The File Motel [1] uses a WORM to store backup copies of the files of conventional file systems. A separate database, which resides on magnetic disk, is used to find the files on the WORM. The integrity of the WORM data is ensured by the ability to reconstruct this database if it is corrupted. This approach works well, but a file system is not hhhhhhhhhhhhhhhhhh †Current address is Computer Science Department, Stanford University (sean@neon.stanford.edu). -2backed-up in an atomic operation. Copying all files from the disk file system to the WORM requires a substantial period of time and unless the disk file system is unmounted while this occurs, one can still modify files. The result is a backup that is "smeared" over time. An alternate approach is to place a general purpose file system directly on a WORM, taking into account the restriction that blocks can be written only once. One example is the Optical File Cabinet [2,3] which uses a block replacement strategy. The file system sees a logical address space that is much smaller than the size of the WORM device. Logical block addresses are mapped to physical WORM addresses; each time a logical block is written, the mapping is changed so that an unwritten physical block is used, i.e., a logical block can be written multiple times. The Optical File Cabinet has the advantage that the file system is unchanged and requires only a different device driver. However, a WORM block is allocated for every write operation and so the logical address space must be much smaller than the physical address space. This allocation can be reduced by the use of a cache, but the consumption of WORM blocks is still high. Optical disks are still considerably slower than magnetic disks so the performance of the Optical File Cabinet compares poorly with a conventional file system. This paper describes a novel approach for integrating a WORM device into a file system. As with the Optical File Cabinet, the file system resides on the WORM. However, the file system uses the address space of the WORM, carrying out reads and writes to WORM block addresses. A general purpose file system is presented to the user; the write-once nature of the main storage medium is hidden by using a copy-on-write scheme. When a block is written, the data is not transferred directly to the WORM; instead, it is cached on a magnetic disk, the WORM cache, with the result that certain blocks can be written multiple times and the consumption of WORM blocks is reduced. The cache also masks the relatively slow performance of the optical disk. In addition, the backup system is unusual. At the user’s request, the file system freezes activity and flushes the WORM cache, thus ensuring that all data is on the WORM. A consistent snapshot of the file system appears within seconds in a parallel read-only file system, which can be accessed using standard file utilites. From the user’s perspective, the entire backup procedure takes a matter of seconds. Block States The WORM cache contains a subset of blocks in the file system address space. Each block in the cache has an associated state, one of free, write, dump, or read. There is a pseudo state notfound for all blocks that are not in the cache but they are conceptually read blocks. The states have the following properties: free: The block is only in the cache and has not been written to the WORM. The block contains no data and is not in use in the file system. write: Only in the cache. The block can be read and/or written: read-write. dump: Only in the cache. The block is read-only. read: On the WORM and in the cache. The block is read-only. -3Blocks are created in the free state. The transition to the write state occurs when a block is incorporated into the file system tree and used to store data. A block can return to the free state if the block is subsequently not needed, for example when a file is deleted. The dump command flushes the cache and results in a consistent and complete snapshot of the file system residing on the WORM. The transition from write to dump occurs when the dump command is executed, representing the semantic transition from read-write to read-only. The dump state is only transient; the transition from dump to read occurs when the block is written to the WORM by a background daemon. Figure 1. gives the state transition diagram for a block. Used to store data free write Dump command dump Written to WORM read Returned to free list Figure 1. The state transition diagram for a block. The File System The file system provides a structure similar to the UNIX file system. There is a hierarchy of directories and ordinary files which are accessed by a set of system calls resembling open, close, read, write, etc. Like the UNIX file system, blocks are accessed via a pool of buffers that reside in memory and have the effect of caching both read and write operations. When a block is referenced that does not reside in the buffer pool then I/O operations are required. I/O operations do not occur directly to the WORM, but instead take place via the WORM cache which resides on a magnetic disk. The WORM cache consists of blocks that are read-write or read-only depending on a block’s current state. The file system consists of a mixture of both read-write and read-only blocks. In contrast, a conventional file systems assumes that all blocks are read-write. The implementation of the file system hides the fact that certain blocks are readonly. The user sees a general purpose file system that has no restrictions, other than the usual security restrictions, on which files may be modified. The WORM and the cache are treated as a special device, a cache WORM device, that contains both read-write and read-only blocks. A conventional file system was modified to handle the presence of read-only blocks by using a copy-on-write scheme. This scheme is described in the remainder of this section. The file system consists of two elements: a set of files that are either directories or ordinary files and a list of free blocks. The free list is discussed in the next section; for now, let us assume a free block can be obtained as needed. Directories are files of directory entries. A directory entry contains information about a file such as its name, owner, permissions, etc, and a list of block addresses -4containing data for the file. In our implementation, this list has a more complicated structure for random access to large files but, for the purpose of this discussion, we will consider the list to consist of a simple array of addresses. The extension to more complex structures is not difficult. The directory hierarchy is a true tree, i.e. there are no hard links and no blocks are shared between files. A root block contains the directory entry for the root of the tree and all blocks can trace a path from this root. The address for a block is stored in exactly one location, namely the directory entry of the file in which the block is contained. If a block is to be replaced by a different block then only this one address need be changed. When a file is opened, the system walks the path from the root to the file. At each step in the walk, a current directory entry in either a dump or read block is moved to a write block. This move is achieved as follows: g A free block is marked as write. g The data in the old block is copied to the write block. g The pointer to the old block, which resides in the parent directory, is changed to point to the write block. The pointer can be changed as the root is always maintained in a write block so, by induction, the parent directory is in a write block. The result is that the directory entry of all open files is contained in a write block and can thus be modified. After a file is opened, a copy-on-write scheme is used to enable the file to be modified. If an attempt is made to write to either a dump or read block in the file then a write block is allocated and substituted using the same procedure as when the file was opened. This substitution is possible since the directory entry for the file is contained in a write block. Figure 2. gives an example of opening then modifying the file /etc/passwd. A write block is allocated when an attempt is made to write to a dump or read block. The structure of the file system is used to determine the location of the pointer which needs to be changed to reflect the substitution of the write block. The pointer itself is guaranteed to reside in a write block. The Dump Command One of the main goals of this file system is to provide a convenient and efficient means of recording snapshots of the entire file system. These snapshots reside entirely on the WORM, which is both reliable and write-once; there is no need to backup the file system to an external medium. Moreover, as the snapshot resides on the same device as the file system, blocks can be shared between both, and even with previous snapshots. This sharing reduces the storage requirements for snapshots of the entire file system to the equivalent of incremental backups. To create a snapshot, the dump command is executed. The snapshot of the file system is accessed as a directory in a second read-only file system called the dump. The user can mount this second file system and access files in the standard manner, although write operations are not permitted. Utilities such as the UNIX commands cp, grep, diff, ls, etc, can be used to examine and restore information contained in the snapshots. The root of the dump file system contains a directory entry for each snapshot taken and thus the user has a complete history of backups, available online, and accessible in a familiar fashion. It should be noted that the original file permissions apply, so no -5- a) / etc usr /etc passwd group /etc/passwd . . . . . . . . write read read / etc usr /etc passwd group /etc/passwd . . . . . . . . write read /etc passwd group data b) data read . . . . write c) / etc usr /etc passwd group . . . . . . . . write read /etc passwd group . . . . write /etc/passwd data read /etc/passwd data write Figure 2. Opening then modifying /etc/passwd. special arrangements need to be made for access to the dump; one cannot use the dump to subvert security. However, the dump cannot be changed; if a snapshot is made at a time when a sensitive file is temporarily readable then the file will remain accessible forever. The dump file system is read-only; the copy-on-write scheme used in the main file system is not needed as there are no write operations. It follows that the invariants used to implement copy-on-write can be relaxed. In particular, the directory entry for an open file is not moved to a write block and the file system structure is a graph rather than a tree. A block that remains unchanged between dumps will appear more than once in the dump file system even through it appears only once on the WORM. The dump occurs in three stages: First, user activity is frozen and the WORM cache -6is flushed. Second, the dump file system is modified to include a snapshot of the file system. Third, the file system is returned to a state from which normal operation can continue. The file system is placed in a consistent state by completing all current file operations and freezing all further activity. Any dirty memory I/O buffers are written to the WORM cache and then the WORM cache is flushed. To flush the WORM cache requires all write blocks to be written to the WORM, a time consuming process. To hide this time from the user, write blocks are converted to the transient dump state; no WORM I/O is performed. This change of state represents the semantic transition from read-write to read-only. A background daemon, described below, writes dump blocks to the WORM, completing the final transition to the read state. Flushing the WORM cache has the effect of making all blocks in the file system read-only, i.e., a snapshot has been taken. The root of the file system provides an access point to the snapshot. A pointer to the root block is stored as a new entry in the root directory of the dump file system. The new entry has a unique name generated from the time that the dump command was issued. To create an entry, the dump file system must be modified, requiring the root of the dump file system to be moved to a write block. This modification is a special case where the dump file system is not read-only and results in a new root address for the dump file system. The final step is to return the file system to a state from which normal operation can continue. The root of the file system is moved to a write block. Also, the path from the root to each open file is walked and the associated directory entries are moved to write blocks. Modifications to the file system can now be handled as discussed in the previous section. Note that changes to the file system are only visible through the new root block; the file system remains unchanged when viewed through the old root. In effect, the snapshot and the file system will split apart as the file system is modified. Figure 3. illustrates what the file system might look like before and after a dump. The file /etc/passwd is assumed to be open. The dump is performed as an atomic operation; no changes to the file system are permitted during the procedure. In our current configuration, the dump command takes about ten seconds. The file system is inaccessible for these ten seconds but this is acceptable for us, especially as dumps are generally performed only at night. The majority of time spent executing the dump command involves marking the write blocks as dump. The search of the WORM cache takes constant time determined by the size of the cache, currently 122 megabytes. The other substantial task of the dump command is rewalking the path for each open file. The rewalk is necessary to ensure that the minimum number of directory entries are in write blocks. A large number of open files could increase the period in which the file system is inaccessible. After the dump command, the WORM cache contains dump blocks that have not been written to the WORM. A background daemon copies dump blocks to the WORM and marks them as read. This copying is entirely transparent to the user. As the daemon simply converts blocks from the dump to read state the dump command can be executed again while there are still dump blocks in the WORM cache. -7- a) superblock dump root root next / 654866970 654953375 write write / etc usr /etc passwd group . . . . . . . . write write /etc/passwd data write b) superblock dump root root next / 654866970 654953375 dump dump / etc usr /etc passwd group . . . . . . . . dump / 654866970 654953375 655039726 dump superblock dump root root next write write / etc usr /etc passwd group . . . . . . . . write write /etc/passwd data dump Figure 3. Before and after the dump command, assuming /etc/passwd is open. The time to write all dump blocks to the WORM is expected to be similar to an incremental backup of a regular file system to a similar performance media. First, the number of blocks to write is minimal as any block that has not changed since the previous dump will be shared in both snapshots and need not be written. Second, the background daemon competes for access to the WORM only when a miss occurs in the WORM cache, a rare event, and this conflict is resolved in a fair fashion to avoid starvation. -8Recovery The reliability of the WORM would be negated if the file system could not be recovered in the event that the WORM cache is destroyed. This system can recover from such a failure. If the WORM cache is destroyed then only modifications made after the most recently completed dump are lost. When the dump command is executed, the address of the root of the file system and the address of the root of the dump file system changes. The root addresses are stored with other information in the superblock. Before all write blocks are marked dump, a superblock is allocated. The address of the new superblock is stored in the old superblock; the superblocks are thus linked together. The first superblock on the WORM is at a known address. The list of superblocks can be traversed until the last superblock is reached, detected by the presence of a link to an unwritten block. To restore the system, the WORM cache is initialized and both the superblock and the root of the file system are moved to write blocks. There is one small complication for the case in which the WORM cache is destroyed while there are blocks in the dump state. In this case, the most recent dump has not completely been written to the WORM and a previous dump must be used instead. The Free List The free list is used to maintain a record of free blocks. The head of this free list is contained in the superblock. When a file is deleted, all write blocks in the file are marked free and added to the free list, requiring examination of the state of each block. This examination imposes a small but noticeable overhead for large files. The superblock contains the current size of the file system, i.e., the maximum block address which represents the high water mark on the WORM. When the free list becomes empty the high water mark is increased; in effect a number of blocks are created. These new blocks are placed in the cache as free blocks and added to the free list. Increasing the size of the file system is limited by two factors. First, the file system can not be larger than the total size of the WORM device; when this limit is reached the file system is full. Second, the WORM cache can only contain a certain number of free, write, and dump blocks; when this limit is reached the WORM cache must be flushed, using the dump command, The WORM Cache All WORM I/O is performed through the WORM cache which resides on a magnetic disk. The cache has two purposes: First, write operations are only performed on blocks that reside in the cache and have not been written to the WORM; multiple writes to a fixed address are thus possible. Second, the WORM device that we use, a Sony WDD-2000, is considerably slower than magnetic disk; caching WORM reads increases performance. I/O operations are preceded by the translation of the WORM block address to a cache address, using the cache map, and the cache address is then used as the location for the I/O operation. In the case of a read operation, the block may not reside in the cache causing a cache miss and a read from the WORM device. The cache map consists of a table of cache entries which each contain a WORM -9address and a block state. A cache entry with a non-zero WORM address implies that the block resides in the cache and is in the given state. The data for the block is in a cache data block which is associated with the cache entry. The cache map resides on the disk together with the actual data blocks and each block of the cache map contains an array of d cache entries where d is constant. If there are w blocks in the cache map then there are wd cache entries, which correspond to wd cache data blocks, and w + wd blocks on the disk. For our implementation we use a 122 Megabyte disk made up of 2K blocks for which d = 255 and w = 243 are suitable values. The cache is d-way set associative: to translate a WORM address x, the block (x mod w) of the cache map is searched. A block in the free, write or dump state is always maintained in the cache, hence the cache-entries for such blocks are fixed. As the cache is set associative, the situation can arise in which all entries in a cache map block are fixed and no new blocks can be added. However, blocks are placed in the cache in only two situations: when creating a free block and when reading from the WORM. In the first case, if the free block cannot be placed in the cache then it is ignored and as a result a WORM block is wasted. In the second case, if the block is not placed in the cache then a WORM read will be performed every time the block is accessed and the performance of the system will be reduced. We would like a high probability that no cache map block will contain only fixed entries. To achieve this, the number of fixed entries is limited to some fraction α of the size of the cache. If we assume that such entries have unique addresses with a uniform probability distribution, then the probability that none of the cache map blocks is full of fixed entries is given by the function P(w, d, n) where: w is the number of blocks in the cache map. d is the number of entries in a block. n is the number of fixed entries in the cache, i.e. n = αwd. P(w, d, n) = I 1 if n < d 0 if n ≥ d L K for w = 1 and i d −1 I 1M I 1M P(w, d, n) = Σ InM J hh J J1 − hh J i w i =0L O w L O L n −i P(w − 1, d, n − i) for w > 1 O For our implementation with d = 255, w = 243, and α = 0.75, we get a value for P(w, d, αwd) of about 99.9%. Thus, if a dump is not preformed until the fraction of fixed entries reach α, we can still expect, with 99.9% probability, that no cache map block will contain only fixed entries. In the rare case that a cache map block does become full of fixed entries, the creation of free blocks may fail. Every failure causes one block on the WORM to be wasted since it is never added to the free list and hence never used. The expected number of allocation failures, E, is given by the formula i n E = I 1 M I hh 1M J J1 − J Σ (i − d) Lni MO J hh w w i =d +1 n −i I L O L O hhhhhhhhhhhhhhhhhhhhhhhhhhh n i 1 M I hh 1M nM I hh 1 − J J J J Σ i w i =d +1L O w I L O L O n −i - 10 In our implementation E is about 4 thus we expect to waste about 4 blocks once every thousand dumps. We consider this an acceptable rate given that it allows a simple solution for the hash collision problem. The round-robin algorithm is used to determine which of the possible entries in a cache map block is used when adding a block to the cache. The extra complexity and storage requirements to implement other algorithms such as LRU did not appear justified, however, future performance analysis of the file system may prove this incorrect. Finally, the overhead in accessing WORM blocks via the WORM cache is minor. The majority of accesses are expected to be satisfied by the I/O buffers, which act as a first level cache in memory. For a WORM block, the WORM address is used to check these buffers and no examination of the WORM cache is needed. If a block is not in a buffer then the WORM cache is checked by searching one block of the cache map; a linear search of d entries. The cache map blocks are also accessed via the buffers and will generally reside in memory. A cache hit results in I/O to the WORM cache disk, while a cache miss requires a read from the WORM. In either case, the cost of the I/O should outweigh the cost of searching the WORM cache. Note that as the size of the WORM cache is increased, the value of d can be held constant, although α may need to be adjusted. Implementation The file system has been implemented on a VAX-750 used as a dedicated file server for Plan 9, an experimental operating system under development at AT&T Bell Laboratories. The implementation consists of about 7000 lines of C. The system is configured with 6 megabytes of memory for I/O buffers, 122 megabytes of disk for the WORM cache, and a single 1.5 gigabyte write-once optical disk (Sony WDD). In the near future the system will be moved to a MIPS 6280 with 100 megabytes of I/O buffers, 1 gigabyte of WORM cache, and a 300 gigabyte Sony Writable Disk Auto Changer (a jukebox). The jukebox will be regarded as a single device and accessed via the one file system. This file system was not designed for the UNIX operating system; however, the structure is similar to the UNIX file system with the exception that hard links are not possible. This restriction is not severe as soft links are generally an alternative and Plan 9 uses a different approach for file aliasing. The current file system can be accessed transparently as a remote file sever from machines in our laboratory that run the UNIX operating system. Future Work One of the major issues that has not been addressed is how to deal with the finite size of the WORM device. Each time the dump command is executed, some WORM blocks are irrevocably written. A problem arises when the WORM becomes full. Our current solution is to assume the WORM has infinite size; we have yet to fill one of our optical disk jukeboxes. Over the three years that we have used optical disk jukeboxes, the density of the disks has quadrupled. Although we consume WORM disks at an exponential rate, the technology is improving even faster. Even if the WORM does become full, it is expected that the size of the file system - 11 will still be considerably smaller that the size of the WORM. The WORM will contain many blocks that are not accessible via the file system, e.g., blocks from files that can only be accessed via the dump file system. It may be possible to move the file system to a second WORM media, enabling the inaccessible blocks to be reused. Conclusion The described method of utilizing a WORM device is both general and efficient. Users are provided with a standard file system which can be larger than the available magnetic disk resources. Snapshots of the file system can be taken at regular intervals with almost no inconvenience to the user. The snapshots reside entirely on the WORM and are accessed as a regular, albeit read-only, file system. In the event that the WORM cache is destroyed, the file system can be restored to the state of a previous snapshot. Finally, if the working set of the file system is smaller than the WORM cache, then little or no performance degradation is noticed by the user. Acknowledgements Ken Thompson suggested the idea of using the WORM address space for the file system and the algorithm for the dump command. He also provided much help with the implementation. This paper was improved by comments from Steve Bellovin, Tom London, Ross Quinlan, Ken Thompson, and two anonymous reviewers. References [1] Hume, A., The File Motel - An Incremental Backup System for Unix, Summer Usenix Conference Proceedings, 1988, pp 61-70. [2] Gait, J., The Optical File Cabinet: A Random Access File System for Write Once Optical Disk, IEEE Computer, May 1988. [3] Laskodi, T., Eifrig, B., Gait, J., A UNIX File System for a Write-Once Optical Disk, Summer Usenix Conference Proceedings, 1988, pp 51-60.