Venti Ohad Rodeh Venti – p.1/37 Introduction This lecture is based on: Venti: a new approach to archival storage. Sean Quinlan and Sean Dorward. Bell Labs, Lucent Technologies. FAST 2002 Conference on File and Storage Technologies. Venti was the first to introduce content-based hashing for archival storage purposes. This approach has become accepted practice since. Venti – p.2/37 Motivation Archival storage is a second class citizen. Many systems do not allow access to previous versions of files or databases. Systems like AFS and WAFL do allow it, but only to a limited number of snapshots. Data is not kept in perpetuity. Common practice is to use an additional system, tape, for backup. But then, access to backup data is tedious. Venti – p.3/37 Tape setup Tape systems are used as a form of second-level storage Typically: a tape backup system serves several client machines Backup software on the clients interfaces with the tape, reads the contents of the databases and file-systems and decides what to backup. Data is copied over the network to the tape system Venti – p.4/37 Tape backup Restoring data from backup can be tedious and error prone The backup system violates the access permission of the file-system requiring a system-administrator or privileged software Restore operations are infrequent, so problems may go undetected Potential problems: tapes are mislabeled, reused, lost, drives wander out of alignment, technology becomes obsolete. Venti – p.5/37 Trade-off There is a trade-off between backup and restore Full backup is expensive but provides simple restore. Normally, incremental backup is done. Complicates restore. Venti – p.6/37 Main observations The growth in capacity of disk storage allows all data to be kept on disk, online. Use a write-one policy. Never erase data. Obviously, some data is too large too be kept, but once it is decided to keep it, it is never erased. Eliminates the tedious task of deciding what to erase. Simplifies the system because it does not need to overwrite nor delete data. Venti – p.7/37 The Venti archival server Venti is a block-level network storage system. It is not a database or a file-system. It provides a block-based back-end for storage applications. Venti exports a simple interface: read/write variable sized blocks of data. Venti – p.8/37 Unique hashes Venti identifies data blocks by a hash of their contents. By using a collision resistant hash function with a sufficiently large output, it is possible to consider the hash of a data block as unique. The hash is the fingerprint of the block The hash is then used as the address of the block for read/write operations This approach results in a storage system with some interesting properties Venti – p.9/37 Hashes as addresses As blocks are addressed by their hash, a block cannot be modified without changing its address Intrinsically write-once behavior In most other storage systems the address of a block never changes Venti – p.10/37 Backup behavior with Venti Multiple writes of the same data can be coalesced and do not require additional storage space. This simplifies backup behavior A client can perform a full-backup, Venti will eliminate redundancy between the new data and old data. No space will be wasted by moving from incremental to full backup. Furthermore, data from different applications can also be eliminated. Venti – p.11/37 Hashes as addresses II The hash function can be viewed as generating a universal name space for data blocks Without cooperating or coordinating multiple clients can share the name-space of a Venti server The low-level block interface places few restrictions on applications Any application that uses a disk can use Venti Traditional backup systems require more control. For example, they need to crawl over the database or file-system and differentiate new data from old-data in order to perform incremental backup. Venti – p.12/37 Hashes as addresses III The hash of a block is also used for integrity checking Since the contents of a block are immutable, the problem of data coherency is greatly reduced. A cache or a mirror cannot contain stale data Venti – p.13/37 Choice of hash function Venti requires a hash function that generates a unique fingerprint for every block In practice, this is done using a cryptographic hash Venti uses Sha1 which has an output of 160bits Probability of a collision: 1. n is the number of blocks 2. b is the number of bits in the hash: p≤ n(n−1) 2 × 1 2b Venti – p.14/37 Likelihood of a collision In a system with 1018 bytes of data stored as 8KB blocks the probability of a collision is about 10−20 This is sufficiently unlikely to be ignored In the future, larger hashes can be used. Venti – p.15/37 Using Venti by an application Venti poses a unique challenge to an application Writing is performed by sending data blocks to the Venti server In order to read a block of data, its fingerprint must be provided by the application. This requires the application to store block fingerprints Venti – p.16/37 Tree of fingerprints Store the tree of fingerprints on the server. Venti – p.17/37 Vac is like tar or zip, an application that stores many files and directories as one object. Vac The contents of files are stored as a tree of blocks The root fingerprint is stored at the file (*.vac) specified by the user The fingerprint is stored in ascii format plus a header for a total of 45 bytes This makes it look as if the entire archive takes up 45 byte on disk Venti – p.18/37 Vac II Vac stores each file as a separate tree of blocks This ensures that duplicate copies of a file will be coalesced on the server If multiple users vac the same data, it will be stored on the server exactly once. Repeatedly vac-ing the same directory will not use up more storage Even if the directory changes, only the changes take up additional space Venti – p.19/37 The plan-9 file-system Plan-9 is a Unix-like operating system from Bell labs It can be downloaded from the bell-labs site It has a file system that supports snapshots Previously, the plan9-FS was stored on a combination of magnetic disks and write-once optical jukebox Venti – p.20/37 The plan-9 file-system II The jukebox provides permanent storage The disks act as a writeable-cache for the jukebox The cache provides faster access and also accumulates changes between snapshots When a snapshot is taken, all modified blocks are written to permanent storage Venti – p.21/37 The Plan-9 file-system III The cache can be smaller than the active file-system However, accesses that miss the cache are significantly slower since changing platters in the jukebox takes seconds This performance penalty makes certain operations on backup storage prohibitively expensive Also, when the cache is reinitialized due to corruption, the file-server spends several days filling the cache before performance returns to normal Venti – p.22/37 Plan9-FS based on Venti The new version of Plan9-FS uses Venti as its back-end instead of an optical jukebox This simplifies things because Venti has the same latency as a disk Venti – p.23/37 Implementation An append-only log for data blocks An index that maps fingerprints to locations in the log Venti – p.24/37 The log Since Venti is intended for archival storage, robustness is particularly important The log is placed on a RAID array to protect against disk errors The log is append only, data is never overwritten nor erased The log is separated into arenas Data is compressed before being inserted into an arena Venti – p.25/37 The log II Venti – p.26/37 The index The index is implemented by a disk resident hash-table The index is divided into fixed-sized buckets Each bucket is stored as a single disk-block Each bucket contains the index-map for a small section of the fingerprint space A hash function is used to map fingerprints into buckets in a relatively uniform fashion. Binary search is used inside a bucket This structure provides an almost always one disk access per lookup. Venti – p.27/37 The index II Venti – p.28/37 Performance issues The main disadvantage of Venti compared to a standard block-based system is the need to go through the index Three techniques are used to offset this disadvantage: caching, striping, and write buffering. Venti – p.29/37 Caching One cache for the index, another for data blocks. First, the data cache is checked. Second, the index cache. Caching, however, does not help writing new data much. First, Venti needs to check if the block already exists. Since the block is new, it is obviously not in cache. Since the fingerprint is essentially random, it will most likely miss the index cache. Therefore, the write performance will be limited to the random IO performance of the index disk. Venti – p.30/37 Hardware The prototype Venti server is implemented with 1. Plan 9 operating system 2. 10000 lines of C code 3. Dual 550Mhz Pentium III 4. 2GB of memory 5. 100Mbit/sec Ethernet 6. The data log is stored on 500GB MaxTronic IDE RAID-5 array 7. The index resides on a string of 8 Seagate Cheetah 18XL 9GB SCSI disks Venti – p.31/37 Base performance The main problem occurs when performing un-cached reads Venti – p.32/37 Historical data Two file servers: Bootes and Emelie Boots: 1990-1997, block-size 6KB Emelie: 1998-2001, block-size 16KB Total of 522 user accounts, 50-100 active at any one time Large data sets: astronomical data, chess end games, etc. Venti – p.33/37 File servers Venti – p.34/37 De-duplication Venti – p.35/37 Reliability and Recovery Part of Venti’s charter is to build a reliable permanent store Special tools were built to check for integrity and recover from corruption Examples: 1. Verifying the structure of an arena 2. Checking that there is an index entry for each data block These tools directly access the storage and are executed on the server Venti – p.36/37 Summary Venti introduced the concept of content-addressable storage Now a fully established concept EMC sells the Centera product Venti – p.37/37