Hierarchical File Systems are Dead
Margo Seltzer, Nicholas Murphy
Harvard School of Engineering and Applied Sciences

Abstract

and management of a range of applications employing
stable storage, often enabling features that the applications do not include on their own (e.g., archiving in the
case of the tar utility).
The situation, however, has evolved. In 1992, a “typical” disk was approximately 300 MB. In 2009, a typical
disk is closer to 300 GB, representing a three order of
magnitude increase. While typical file sizes have also increased, they have not increased by the same margin. As
a result, users may have many gigabytes worth of photo,
video, and audio libraries on a single pc. This situation
represents a management nightmare, and mere hierarchical naming is ill-suited to the task. One might want to
access a picture, for instance, based on who is in it, when
it was taken, where it was taken, etc. Applications interacting with such libraries have evolved external tagging
mechanisms to deal with this problem.
At the same time, use of the web is now ubiquitous,
and ”Google” is a verb. With the advent of search engines, users have learned to find data by describing what
they want (e.g., various characteristics of a photo) instead
of where it lives (i.e., the full pathname of the photo in
the filesystem). This can be seen in the popularity of
search as a modern desktop paradigm in such products as
Windows Desktop Search (WDS) [26]; MacOS X Spotlight [21], which fully integrates search with the Macintosh journaled HFS+ file system [7]; and the various
desktop search engines for Linux [4, 27]. Indeed, Mac
OS X in particular goes one step further and exports APIs
to developers allowing applications to directly access the
meta-data store and content index.
It should be noted that while search is a database-like
model for information retrieval, databases themselves
tend to be too heavy-weight a solution. They may not be
ideally optimized for a given application, and they prevent independent evolution of the data. In addition, most
databases are painful to both install and manage.
Of course, along the way there have been sporadic efforts to merge the features of filesystems and databases.

For over forty years, we have assumed hierarchical file
system namespaces. These namespaces were a rudimentary attempt at simple organization. As users have begun to interact with increasing amounts of data and are
increasingly demanding search capability, such a simple
hierarchical model has outlasted its usefulness. For this
reason, we should design file systems whose organizations map to the ways we access and manipulate data
now. We present a new file system architecture in which
we replace the hierarchical namespace with a tagged,
search-based one.

1

Introduction

Interaction with stable storage in the modern world is
generally mediated by systems that fall roughly into one
of two categories: a filesystem or a database. Databases
assume as much as they can about the structure of the
data they store. The type of any given piece of data
is known (e.g., an integer, an identifier, text, etc.), and
the relationships between data are well defined. The
database is the all-knowing and exclusive arbiter of access to data.
Unfortunately, if the user of the data wants more direct control over the data, a database is ill-suited. At the
same time, it is unwieldy to interact directly with stable
storage, so something light-weight in between a database
and raw storage is needed. Filesystems have traditionally
played this role. They present a simple container abstraction for data (a file) that is opaque to the system, and they
allow a simple organizational structure for those containers (a hierarchical directory structure). This weak set
of assumptions about the structure of data has nonetheless allowed construction of a variety of general-purpose
tools to be created (e.g., ls, tar, etc.) that could operate on application data without knowing about its internals. These tools, in turn, facilitated easy construction
1

The file systems community began to adopt components
of database technology: journaling (logging) [1, 2, 6],
transactions [5, 12, 15, 17, 18], and btrees [5, 12, 25].
Periodically, researchers suggested that we should begin thinking of our file systems more like databases.
First, in 1991, Gifford et. al proposed the semantic
file system, which allowed users to create virtual directories organized by attributes rather than by hierarchical names [19]. In 1993, Olson proposed implementing
the file system on top of a database, providing querybased access to file system data [8]. But the hierarchical directory structure has always remained as the central
filesystem abstraction. We believe it is time to finally get
rid of the hierarchical namespace it in favor of a more
search-friendly storage system. The hierarchical directory model is an increasingly irrelevant historical relic,
and its burial is overdue.

2

access patterns evolve over time? MacCormack [11]
noted, for instance, that while software architects frequently organize code into directories according to functionality, after a few years, the software’s actual structure
and interactions do not match the architect’s initial vision. Even with appropriate clustering, Stein [22] notes
that modern storage systems often violate assumptions
made by existing filesystems anyway, and thus any performance gains by such clustering may be illusory to
begin with. His work focused on the effect of systems
like SANs, but upcoming solid-state drives, for which
sequential access may no longer be fastest, will have the
same effect.

2.3

In 1981, Stonebraker noted that implementing a database
on top of a file system adds a superfluous level of indirection to data access [23]. This has not changed. Consider the path between a search term and a data block in
most systems today. First, we look up the search term
in an indexing system, which is built on top of files in
the file system. Translating from search term to the file
in which it is found requires traversing two indices: the
search index and the physical index (i.e., file system direct/indirect blocks). That search yields a file name. We
now navigate the hierarchical namespace, which is another type of index traversal, which could require multiple subsequent index traversals depending on the length
of the path and the number of entries in each directory in
the path. Finally, we have reached the meta-data for the
file we want, and we have one last index traversal of the
physical structure of that file. At a minimum, we encountered four index traversals; at a maximum, many more.
Even if a system can capture all the indexes in memory,
these multiple indexes place pressure on the processor
caches.
Additionally, in today’s multicore world, parallelism is becoming increasingly important, and a
hierarchy imposes unnecessary concurrency bottlenecks. For instance, the directories /home/nick and
/home/margo are functionally unrelated most of the
time, yet accessing them requires synchronizing read access through a shared ancestor directory. A file system
hierarchy is a simple indexing structure with obvious
hotspots; better indexing structures with fewer hotspots
exist, so we should take advantage of them.
We therefore posit the following basis of a modern
filesystem:

The Hierarchical Namespace Albatross

But what does our aging filesystem infrastructure cost?
Is it such a burden? We argue that a hierarchical namespace suffers from the following problems:

2.1

Irrelevance

Users no longer know where there files are stored. Moreover, they tend to access their data in a more search-based
fashion. We encourage the skeptical reader to ask nontechnical friends where their email is physically located.
Can even you, the technically savvy user, produce a pathname to your personal email? Your Quicken files? The
last Word document you edited? The last program you
ran? For that matter, how many files are on your desktop
right now? How many of them are related to each other?

2.2

Performance Limiting

Restrictiveness

There is no single, canonical categorization of a piece of
data, and imposing one in the form of a full pathname
conflates naming (how a piece of data is identified) with
access (how to retrieve that piece of data). Much as a
single piece of clothing may belong to multiple outfits, a
single piece of data may belong to multiple collections.
Note that we are not condemning hierarchy in general;
indeed, hierarchy can be useful in a variety of situations.
Rather, we are arguing against canonizing any particular
hierarchy. A data item may have many names, all equally
useful and even equally used.
Note, for instance, that the Unix Fast File System [13]
attempts to group items logically in the same directory in
the same (or a nearby) cylinder group. If those items are
always accessed together, this arrangement works well,
but what if the data are accessed in different ways, or

• Backwards compatibility – With so much of the
world currently built on top of hierarchical namespaces, a storage system is not useful without some
support for backwards compatibility in interface if
not in disk layout.
2

• Separate naming from access – The way one locates a piece of data should not be tied to how it is
accessed, and multiple (potentially arbitrary) means
of location should be supported.

possible names. A full text search on search terms S1 ,
S2 , ... Sn translates into a naming operation on the
vector of tag/value pairs of the form FULLTEXT/S1 ,
FULLTEXT/S2 , etc. As one would expect, the result
of such an operation is the conjunction of the results of
an index lookup for each element in the vector. Naming operations can return multiple items (which will be
returned in an unspecified order). Moreover, no query
need uniquely define a data item. Only the identifier for
the data in the OSD layer must be unique.

• Data agnostic – Applications should be able to impose their own structure on the data items they store,
and as such the filesystem should treat data items as
opaque sets of bytes.
• Direct access to data – As the filesystem should be
the thinnest possible veneer that is still useful over
storage, it should allow direct access to data items.
If an application knows exactly which data item it
needs, it should be able to retrieve it directly.

POSIX

We have developed a new file system architecture, hFAD
(Hierarchical Filesystems are Dead). hFAD eschews a
hierarchical namespace, instead using a tagged, searchbased namespace. We present the system architecture
and a brief discussion of our implementation approach.
At its lowest level, hFAD resembles an object-based
storage device (OSD) [14]. Storage objects have a unique
ID, and higher layers of the system access these objects
by their ID. Unlike traditional OSDs, our objects are
fully byte-accessible: not only can you read bytes from
the object, but you can insert bytes into the middle of
objects, remove bytes from the middle, etc.
Internally hFAD requires an indexing infrastructure
that supports its novel, search-based API. The indexing
structure contains an extensible collection of indices facilitating multiple naming modes and types of search.
Figure 1 shows the hFAD architecture. In the following
sections, we step through each of the major architectural
components of the system and then conclude with a discussion of how we are implementing the components.

image

fulltext

tags

Native API

Index Stores

API

There are two main components to the native hFAD API.
The naming interfaces map tagged search-terms to objects. The access interfaces manipulate an object, once it
has been located.
3.1.1

Naming

The access interfaces support reading and writing as
standard filesystems do, but due to our implementation
(see 4) we can easily also support insertion and removal
operations, enabling inserts into the middle of objects
and truncates from anywhere in the file. To support this
feature, we add an insert and a truncate call.
The read and write calls are compatible with
POSIX for ease of supporting legacy applications. The
insert call takes arguments identical to the write
call, but instead of overwriting bytes in the middle of
a file, it inserts those bytes into the appropriate position,
growing the file by the number of bytes being inserted.
Both insert and write can be used to append data to
a file.
Truncation is similarly extended. While the POSIX
truncate takes a single off_t with the number of bytes
to truncate from the end of the file, hFAD takes two

hFAD: A New File System Architecture

3.1

Access Interfaces

index
Access

3

3.1.2

extents
OSD

Naming Interfaces

Stable Storage

An object is named by one or more tag/value pairs. A
tag tells hFAD how to interpret the value and in which
of multiple indexes to search for the value. For example, we support POSIX naming as a thin layer atop the
native API. A naming operation on POSIX path P translates into a lookup on the tag/value pair: POSIX/P . Note
that a POSIX path is simply one name among many

Figure 1: hFAD Architecture Index stores combined
with arbitrary-length extents provide the primary means
of accessing stable storage. A POSIX interface can easily be implemented on top of these services.
3

Use

Tag

Value

POSIX
Search
Manual

POSIX
FULLTEXT
USER
UDEF
APP
USER
ID

pathname
term
logname
annotations
application name
logname
object identifier

Applications
FastPath

such container (object) has associated meta-data identifying the object’s security attributes, its last access and
modified times, and its size. The OSD layer is comparable to the ZFS Data Management Unit (DMU)[20], although the DMU provides collections of objects (objsets)
as an abstraction and we do not. In ZFS, the DMU is
a transactional object store; in hFAD, the OSD may be
transactional, but this is an implementation decision, not
a requirement.

Table 1: Type/Value pairs for different API uses.
Callers into the Native API use different tags to identify
different kind of values. For example, a regular keyword
search specifies the FULLTEXT tag for all search terms.
Applications tag items with the application name and the
user who ran the application. Users can also add tags
manually using the UDEF tag.

3.4

While we presented our architecture top-down, the implementation is easiest to understand bottom up.
We use Linux/FUSE to implement our hFAB prototype. The lowest layer of the OSD is a buddy storage
allocator [9]. We’ve ported both Berkeley DB [16] and
Lucene [10] to sit atop the raw device and the storage allocator. We allocate objects into variable sized extents.
We represent objects in the OSD as Berkeley DB (BDB)
btree databases whose keys are file offsets where extents
begin and whose data items are the disk addresses and
lengths corresponding to those offsets. As noted above,
the use of btrees gives us the capability to insert and truncate with little implementation effort.
We use a NULL key value in the Btree to store the
meta-data associated with an object. Indeed, POSIX
metadata can easily be stored in a similar fashion as a
unique key (or set of unique keys) for a file’s btree. Directories also potentially map nicely onto btrees as well.
We also use BDB Btrees to map unique object IDs
(OID) to the meta-data for an object and for all string
indexes other than the full text search. Finally, we use
Lucene for full-text search indices, and we use background threads to perform lazy full-text indexing.
Finally, the APIs are implemented as a thin layer on
top of the index stores.

off_t’s, an offset and length, indicating exactly which
bytes to remove from the file.

3.2

Index Stores

We specify an extensible index store to facilitate efficient
search on rich data types. Given one or more type/value
specifications, the collection of index stores must return
a list of objects IDs matching the search terms. Just as
the database community is moving away from a onesize-fits-all approach to data management [24], we believe that efficient access requires multiple indexing approaches. For example, a key/value store suffices for
simple attributes, but not for full-text, and neither a fulltext index nor a key/value store is likely to be suitable for
image indexing.
In earlier work, we discovered that tagging indexed
data provides valuable information to help in query processing [3]. Building on this experience, we have identified several types of tags we believe will expedite file
search as well. Table 1 shows how different use cases
will use different types of tags in their requests.
The index store layer is designed to support multiple
indices. While keywords and pathnames might be easily stored in the same index, we want to leave open the
possibility of extending hFAD with arbitrary index types,
such as indices on images, sound, etc. A special tag, ID,
indicates that the value is actually a unique object ID,
supporting object reference caching inside applications.

3.3

Implementation

4

Open Questions

Obviously, the description of hFAD above is incomplete,
and there are a variety of open research questions associated with this work. Among them:
• Should hFAD support arbitrary types of indexing
through, for example, a plug-in model? If so, how?
• Could/should we employ ideas from the semantic
filesystem work to extend the notion of a “current
directory” to be an iterative refinement of a search?
• How much should the index stores do? Should they
support arbitrary boolean queries? Should they include full-fledged query optimizers? How much
control should they expose to filesystem clients?

OSD Layer

The storage layer is responsible for presenting the abstraction of a uniquely identified container of bytes. Each
4

5

Conclusions

[19] Semantic file systems. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (October 1991), pp. 16–25.

We have presented a file system architecture that eschews
the hierarchical namespace in favor of a tagged, searchbased one. As user expectations move increasingly away
from organized data to search-based data, this architecture provides a more natural way for users to manage and
access data. We encourage other researchers to design
their own implementations, and we look forward to the
comparisons of different implementations and how well
these new system work relative to historical practice.

[20] S OLARIS. Zfs source tour. http://opensolaris.org/
os/community/zfs/source/#DMU.
[21] Working with spotlight. http://developer.apple.com/
macosx/spotlight.html.
[22] S TEIN , L. Stupid file systems are better. In HOTOS’05: Proceedings of the 10th conference on Hot Topics in Operating Systems
(Berkeley, CA, USA, 2005), USENIX Association, pp. 5–5.
[23] S TONEBRAKER , M. Operating system support for database management. Commun. ACM 24, 7 (1981), 412–418.
[24] S TONEBRAKER , M., AND C ETINTEMEL , U. “one size fits all”:
An idea whose time has come and gone”. In Proceedings of
the 21st International Conference on Data Engineering (Tokyo,
Japan, April 2005), IEEE Computer Society Press.

References
[1] Jfs. http://oss.software.ibm.com/jfs.

[25] S WEENEY, A., S WEENEY, A., D OUCETTE , D., H U , W., A N DERSON , C., N ISHIMOTO , M., AND P ECK , G. Scalability in
the xfs file system. In In Proceedings of the 1996 USENIX Annual Technical Conference (1996), pp. 1–14.

[2] Veritas sanpoint direct file access.
[3] Layering in provenance systems. In Proceedings of the 2009
USENIX Annual Technical Conference (San Diego, CA, June
2009), USENIX Association.
[4] The beagle project.
Main Page.

[26] Windows search.
http://www.microsoft.com/
windows/products/winfamily/desktopsearch/
default.mspx.

http://beagle-project.org/

[27] The xapian project. http://www.xapian.org/.

[5] C HISNALL , D. Zfs uncovered. http://www.informit.
com/articles/article.aspx?p=712746, April 2007.
[6] G ALLI , R. Journal file systems in linux. 50–56.
[7] Hfs plus volume format. http://developer.apple.
com/technotes/tn/tn1150.html.
[8] The design and implementation of the inversion file system. In
Proceedings of the 1993 USENIX Winter Conference (1993),
pp. 205–218.
[9] K NUTH , D. The Art of Computer Programming Volume 1: Fundamental Algorithms. Addison-Wesley, Reading, MA.
[10] Lucene. http://lucene.apache.org/.
[11] M AC C ORMACK , A., RUSNAK , J., AND BALDWIN , C. Y. The
impact of component modularization on design evolution: Evidence from the software industry.
[12] M AC D ONALD , J. Reiser4 transaction design document. http:
//lwn.net/2001/1108/a/reiser4-transaction.
php3.
[13] M CKUSICK , M. K., J OY, W. N., L EFFLER , S. J., AND FABRY,
R. S. A fast file system for unix. ACM Transactions on Computer
Systems 2 (1984), 181–197.
[14] NAGLE , D., FACTOR , M. E., I REN , S., NAOR , D., R IEDEL , E.,
RODEH , O., AND S ATRAN , J. The ansi t10 object-based storage
standard and current implementation. IBM Journal of Research
and Development 52, 4 (July 2008), 401–411.
[15] O LSON , J. Enhance your apps with file system transactions.
http://msdn.microsoft.com/en-us/magazine/
cc163388.aspx.
[16] O LSON , M. A., B OSTIC , K., AND S ELTZER , M. I. Berkeley
DB. In USENIX Annual Technical Conference, FREENIX Track
(Monterey, CA, June 1999).
[17] S CHMUCK , F., AND W YLIE , J. Experience with transactions
in quicksilver. ACM SIGOPS Operating System Review 25, 5
(October 1991), 239 – 253.
[18] S ELTZER , M., AND S TONEBRAKER , M. Transaction support in
read optimized and write optimized file systems. In Proceedings
of the 16th International Conference on Very Large Data Bases
(1990), Morgan Kaufmann, pp. 174–185.

5