Fast Incremental Indexing for Full-Text Information Retrieval*
Eric W. Brown
brown@cs.umass.edu

James P. Callan

W. Bruce Croft

callanQcs.umass.edu

croft@cs.umass.edu

Departmentof Computer Science
University of Massachusetts
Amherst, MA 01003
USA

tion. Applications such as information filtering and daily
news feed services are constantly processing new docuAbstract
ments. If IR systemsare to support such applications, they
must be able to managecontinually growing documentcolFull-text information retrieval systemshave tradilections.
tionally beendesignedfor archival environments.
A prerequisiteto supporting a growing documentcollecThey often provide little or no support for adding
tion
is the ability to updatethe data structuresusedto index
new documents to an existing document collecthe
collection.
An indexing structure usedby many IR systion, requiring insteadthat the entire collection be
tems
is
the
inverted
file index [SM83, Fal85, HFRYL921.
re-indexed. Modem applications, such as inforAn
inverted
file
index
consistsof a record, or inverted list,
mation filtering, operatein dynamic environments
for
each
term
that
appears
in the document collection. A
that require frequent additions to documentcollecterm’s
inverted
list
stores
a
documentidentifier and weight
tions. We provide this ability using a traditional
for
every
document
in
which
the term appears.The weight
inverted file index built on top of a persistentobmight
simply
be
the
number
of times the term appearsin
ject store. The data managementfacilities of the
the
document,
or
a
more
sophisticated
measureof the sigpersistent object store are used to produce effinificance
of
the
term’s
appearance
in
the
document. Addicient incremental updateof the inverted lists. We
tionally,
the
location
of
each
occurrence
of
the term in the
describeour systemand present experimental redocument
may
be
stored
in
order
to
support
queries based
sults showing superior.incremental indexing and
on
the
relative
positions
of
terms
within
documents.
competitive query processingperformance.
When a batch of new documentsis addedto an existing
Keywords: full-text document retrieval, incredocument collection, a small number of the terms in the
mental indexing, persistent object store, perforbatch will be new to the collection, while the majority of
mance
the termsin the batch will already have inverted lists in the
index. These lists must be updated by appending to them
1 Introduction
the term occurrencesfound in the new documents.This task
Full-text information retrieval (IR) systemsare well estab- is madedifficult by the size characteristicsof inverted lists
and the techniques used to managethem in traditional IR
lished tools for satisfying a user’s information need when
systems. The inverted lists for a multi-gigabyte document
the information baseis a relatively static collection of doccollection will range in size from a few bytes to millions of
uments. However, modem information managementsysbytes, and they are typically laid out contiguously in a flat
temsmust be able to handle a steadyinflux of new informainverted file with no gapsbetweenthe lists.
*Thisworkis supported
by theNationalScience
Foundation
Center
Adding to inverted lists storedin such a fashion requires
forIntelligentInformation
Retrieval
attheUniversity
of Massachusetts.
expensive relocation of growing lists and careful managePermission
to copy without fee all or part of this material is granted
provided that the copies are not made or distributed for direct commercial
ment of free-spacein the inverted file. Rather than update
advantage, the VLDB copyright notice and the title of the publication
and
existing inverted lists when adding new documents,many
its date appeas and notice is given that copying is by permission of the
IR systemssimply rebuild the inverted file by adding the
Very Large Data Base Endowment.
To copy otherwise, or to republish,
new documentsto the existing collection and indexing the
requires a fee and/or special permission from the Endowment.
entire collection from scratch. This technique is expensive
Proceedings of the 20th VLDB Conference
in terms of time and disk space,resulting in update costs
Santiago, Chile, 1994

192

proportional to the size of the total collection after the addition. Instead, we would prefer the cost of the update to be
proportional to the size of the new documentsbeing added.
The INQUERY full-text information retrieval system [TC91, CCH92] provides this desirable update performanceusing the Mneme persistentobject store [Mos90]
to manageits inverted file index [BCCM94]. The key to
providing fast incrementalindexing is i unique inverted file
structure made possible by the data managementfacilities
of the persistentobject store. Inverted lists are allocated in
fixed size objects with a finite range of sizes, limiting the
numberof relocations a growing list will experience. When
an inverted list has exceededthe largest object size, additional large objects are allocated and chained together in a
linked list. Experimental results show that our system is
able to provide superior incremental indexing performance
in terms of both time and space,with only a small impact
on query processing.
In the next section we briefly describe the main architectural features of INQUBRY and Mneme. In Section 3
we discuss our inverted file structure in detail, including
the motivation for the design. We presentexperimental results in Section4 comparing the performanceof our system
with traditional techniques. In Section 5 we discussrelated
work, and we offer concluding remarks in Section 6. Contributions of our work include a technique for supporting
incremental update of inverted lists in full-text information retrieval systems,along with an empirical analysis of
a working implementation. Also, we continue to demonstrate that data managementfacilities for IR systemsneed
not be custombuilt to obtain superior performance.Rather,
IR systemscan be effectively supportedusing appropriate
“‘off-the-shelf data managementsoftware.

2 Architecture
In this section we highlight some of the basic features
of INQUERY and Mneme that are relevant to this work.
Throughout the paper we will refer to the system as it is
describedhere as the “old” version.
2.1 INQUERY
INQUERY is a probabilistic information retrieval system
basedupon a Bayesian inference network model [TC91,
CCH92]. The power of the inference network model is the
consistent formalism it provides for reasoning about evidenceof differing types,allowing multiple retrieval models,
documentrepresentations,and query representationsto be
combinedsimultaneously. Extensive testing hasshown INQUERY to beoneof the bestIR systems,asmeasuredby the
standardIR metrics of recall and precision [Har94, TC921.
INQUERY is fast, scaleswell to large documentcollections,
and can be embeddedin specialized applications.
In INQUBRY, document retrieval is accomplished by
combining evidencefrom the documentcollection with ev-

idencefrom the query to producearanking of the documents
in the collection. The evidence for a document collection
is pre-computedand stored as the weights and locations in
an inverted file index. During retrieval, the inverted list for
each term in the query is accessedand the evidence in the
lists is accumulatedand combined as dictated by the operators in the query. The inverted list for a term is obtainedby
looking up the term in the term dictionary. The term dictionary is built as a hash table with open-chaining conflict
resolution. A term’s entry in the dictionary contains collection statistics for the term and a referenceto the term’s
inverted list. Inverted lists are stored as Mneme objects,
where a single object of the exact size is allocated for each
inverted list.
2.2 Mneme
The Mneme persistent object store [Mos90] was designed
to be efficient and extensible. The basic servicesprovided
by Mneme are storage and retrieval of objects, where an
object is a chunk of contiguous bytes that hasbeenassigned
a unique identifier. Mneme has no notion of type or class
for objects. The only structure Mneme is aware of is that
objectsmay contain the identifiers of other objects,resulting
in inter-object references.
Objects are grouped into files supportedby the operating system. Within files, they are physically grouped into
physical segments. A physical segmentis the unit of transfer betweendisk and main memory and is of arbitrary size.
Objectsare also logically groupedinto pools, where a pool
defines a number of managementpolicies for the objects
contained in the pool, such as how large the physical segmentsare,how the objectsarelaid out in aphysical segment,
how objects are located within a file, and how objects are
created. Object format is determinedby the pool, allowing
objects to be stored in the format required by the application that usesthe objects (modulo any translation that may
be required for persistent storage, such as conversion of
main memory pointers to object identifiers). Pools provide
the primary extensibility mechanismin Mneme. By implementing new pool routines, the systemcan be significantly
customized.
The base system provides a number of fundamental
mechanismsand tools for building pool routines, including a suite of standardpool routines for file and auxiliary
table management. Support for sophisticatedbuffer managementis provided by an extensible buffering mechanism.
Buffers may be defined by supplying a number of standard
buffer operations(e.g.,allocate andfree) in a systemdefined
format. How theseoperationsare implementeddetermines
the policies usedto managethe buffer.

193

3 Indexing
The old version of INQUBRY usesthe traditional method
of indexing a documentcollection and building the inverted

I

100
90

Document File
1. Parsing,
Document
Inversion

I

80

Transaction File

I

2. Sorting
I

I

Sorted Transaction File

3. Splitting
I
1

IO

100

1K

10K

1OOK

1M

Inverted List Size (bytes)

4. Merging
I

/l

Merged Segments

5. Inverted List
Accumulation

I

Inverted Lists

Figure 1: Traditional indexing process.
file [HFBYL92]. This method is referred to as the alfernarive schemein [STGM94]. The processinvolves multiple
steps,diagramedin Figure 1. The input to the processis a
file of documents,which in INQUERY may currently be at
most 256 Mbytesl. Collections larger than this limit must
be broken up into multiple files. In step 1, document parsing assignsan identifier to each document and extracts the
terms from the documents,stemming each term to its root
and eliminating any stop words (words too frequent to be
worth indexing). A surviving term must then be located or
inserted in the term dictionary to obtain its term identifier
and updatethe statistics stored there. After eachdocument
is parsed, it is inverted in main memory and a temporary
file of trunsacfions is generatedfor the document. A transaction consistsof a term identifier, the documentidentifier,
and the locations of each occurrence of that term in the
document, representing the portion of the inverted list for
the term that is contributed by the document.
To convert the transactionsinto inverted lists, all of the
per document inverted list portions for each term,must be
combined by sorting on the term identifier and document
identifier. The transactionfiles may be over one-and-a-half
times the size of the document files, however, so for large
(multi-gigabyte) collections, it is impractical or impossible
to combine the transactionsfor all of the documentsinto a
‘This is due to the current encoding method used for document file
byte offsets. The largest number that can be encoded is 228.

Figure 2: Cumulative distributions over inverted list size
for TIPSIER volume 1, with 627072lists and420 Mb total.
single file. Instead, the transactionsare stored in multiple
files, one for eachdocument file.
The transaction file from eachdocument file is sortedin
step 2, and the sorted transaction file is split one or more
times into head and tail segmentsin step 3. All of the
transaction files are split at the sameterm tk, such that the
head segmentfor each transaction file will contain terms
it1 . . . tk} and the tail segment for each transaction file
will contain terms {tk+l . . . t,}, where there are n terms
in the vocabulary. The corresponding segmentsfor each
transaction file can then be merged in step 4 to produce
a fully sorted set of transactions for the entire document
collection, broken up into multiple segments.
The final step is to build the inverted file. This is simply
a matter of reading eachinverted list from the sortedtransactions, compressingthe list, creating a Mneme object for
the list, and storing the object identifier for the list in the
term dictionary entry for the term. In step5, the transaction
segmentsare processedin order and the inverted lists are
created. Obviously, steps3 and 4 are required only when
there are too many transactionsfrom the document collection to handlein a single file. Thesestepsmay be eliminated
when indexing smaller collections.
Since the entries in eachinverted list are sorted by document identifier, if new documentsare always assignedincreasingdocumentidentifiers, their inverted list entriesmay
simply be appendedto any existing inverted lists. Therefore, adding new documentsto an existing collection does
not necessarilyrequire that the entire document collection
be re-indexed. However, it doesrequire the ability to grow
inverted lists. Providing this functionality is non-trivial,
such that many IR systems,the old INQUERY included,
simply re-index the entire collection.
Using the full functionality of Mneme, however, we can
create a more sophisticatedinverted file structure that will

194

in fact support growing inverted lists. To guide this design,
we first considersomeof the characteristicsof inverted lists,
all of which can be derived from the early observationsof
Zipf [Zip49]. Figure 2 shows the distribution of inverted
list sizes for the TIPSIER volume 1 document collection
used in our performance evaluation below (see Table 1).
For a given inverted list size, the figure shows how many
lists in the inverted file are less than or equal to that size,
and how much those lists contribute to the total file size.
Our first observation is that over 90% of the inverted lists
are less than 1000 bytes (in their compressedform), and
account for less than 10% of the total inverted file size.
Furthermore, nearly half of all lists are less than 16 bytes.
This meansthat many inverted lists will never grow after
their initial creation. Therefore, they should be allocated
in a spaceefficient manner, i.e., reserving extra spacefor
these lists in anticipation of growth would be a mistake.
Also, it is well known that the vocabulary will continue to
grow indefinitely [Hea78], so we must always be prepared
to createmore of thesesmall inverted lists.
Most of the inverted file size is accounted for by a
very small number of large inverted lists. These inverted lists will experience continuous, possibly vigorous
growth. We have also observedthat these large lists have
a high probability of being accessedduring query processing [BCCM94], so they must be allocated in a mannerthat
affords efficient access.
The main extension we make to the old inverted file
structure is this; instead of allocating each inverted list in
a single object of the exact size, we allocate lists using a
range of fixed size objects, where the sizes range from 16
to 8192 bytes by powers of 2 (i.e., 16, 32, 64, . . .. 8192).
When a new list is created, an object of the smallest size
large enough to contain the list is allocated. A list can then
grow to fill the object. When it exceedsthe object, a new
object of the next larger size is allocated,the contentsof the
old object are copied into the new object, and the old object
is freed. When a list exceedsthe largest object size (8192
bytes), rather than copy the list into an even larger object,
we start a linked list of 8192 byte objects. Inverted list
growth is then accomplishedby appendingto the tail object
in the linked list, and adding a new object to the linked list
when the tail is full.
The largestobjectsare eachallocated in their own physical segment and managed by a large object pool. The
smallest (16 byte) objects are stored in 4 Kbyte physical
segments,255 objectsper segment,andmanagedby a smull
object pool. The remaining objects are stored in 8 Kbyte
physical segments,where each segmentstores objects of
only one size, and contains as many objects of that size as
will fit. These objects are managedby a medium object
pool.

This schemeefficiently allocatesthe large numberof relatively small inverted lists in the small and medium object
pools, limits the numberof times a list will be relocateddue

195

to growth, and most importantly, needsto accessonly the
tail object when growing a large inverted list, leaving the
majority of the data in the inverted file untouchedduring an
update.
Now, the inverted file index can be built incrementally.
To add a batch of documents to the index, we parsejust
the new batch, sort the generated transaction file to create inverted lists for the new documents,and then add the
inverted lists to the existing inverted file, creating andgrowing inverted lists as described above. This correspondsto
steps1.2, and 5 in Figure 1, with step 5 modified appropriately. The first two stepsare performed independently of
the existing index. They are also disk oriented operations,
with small main memory requirements. Furthermore, they
produce complete inverted lists for the new documents.
During the final step an inverted list in the index will be
updatedonly once, minimizing the number of costly relocations. Consequently,best performanceis achieved when
new documentsare addedto the index in the largest batches
possible, reducing the overall number of updates.

4 Performance Evaluation
To evaluate our new inverted file structure, we measured
incremental indexing speed,disk spacerequirements, and
query processing speed using the resulting inverted file.
For comparison, we also measuredthe same costs using
traditional techniques. Below we describeour experimental
platform, the test collection used, and the results of our
measurements.
4.1 Platform
All of our experiments were run as superuserwith logins
disabledon an idle DECSystem3000/400(Alpha AXP CPU
clocked at 133 MHz) running OSF/l V1.3. The system
was configured with 64 Mbytes of main memory and six
1.3 Gbyte RZ58 SCSI disks. The executableswere compiled with the DEC C compiler driver 3.11 using optimization level 2. All of the data files and executables were
stored on the local disks, and a 64 Mbyte “chill file” was
read before each batch update or query processing run to
purge the operating system file buffers and guaranteethat
no inverted file data was cached by the file system across
runs. All times reported below are real (wall clock) time.
4.2 Test Collection
For our experiments, we used volume 1 of the TIPSTER
document collection, a standard test collection in the IR
community. Volume 1 is a 1.2 Gbyte collection of full-text
articles and abstracts,divided into sevenmain files. Table 1
gives the relevant statistics for each of the files. The first
three files contain Wall Street Journal articles from years
1987, 1988, and 1989 respectively, doe contains Department of,Energyabstracts,ziff containsarticles and abstracts

Table 1: TIPSIER volume 1 file characteristics. Termsare
new with respectto all files earlier in the table.
File 11 Mb

1 Dots

1

Posts

50 -

I Terms 1
z

40

ii

30

Q
c
g

20

Old, per update New, cumulative -+--New, per update --w.

F
.a- ......._....._...._.
D_
e ___...............
D....‘-..__......

10

1 total 111197.0 1 510887 1 105932530 1 627072 1

.. .,
W’

0

from various periodicals, ap contains AssociatedPressarticles, and fr contains Federal Register articles. For all
indexing and query processing, we use stemming to reduce words to roots, and a stop words list to eliminate the
frequent words not worth indexing.
4.3 Large Updates
Our first experimenttreatseachof the sevenfiles as a separatebatchupdateandincrementally indexesthe volume one
file at a time, in the order listed in Table 1. The results are
plotted in Figure 3. For the two “per update” lines (“Old,
per update” and “New, per update”), eachpoint represents
the time required to add the respectivebatch to the existing
index (the first batch createsthe initial index), where the
points moving left to right correspondto the files in Table 1
moving top to bottom. The per update times are plotted
againstthe total number of posts in the indexed collection
after the update.
“Old” is the traditional schemethat re-indexesthe entire
collection eachtime a new batch is added. The figure indicatesthat the time to add a batch is proportional to the size
of theentire collection after the update. This schemeclearly
doesnot scalewell with the size of the collection. “New” is
the new schemewhich builds the inverted lists for an update
batch separatelyand then adds them to the existing index,
using the new inverted file structure. The time to add a
batch is more proportional to the size of the batch, yielding
incremental update times that are significantly better than
in the old scheme.
Each point for the old schemealso gives the total time
required to index a collection of the cumulative size in one
batch (since the entire collection is re-indexed). The comparabletimes for the new schemeare the cumulative times
for the incremental updates,shown as “New, cumulative”.
This plot shows that the total time to incrementally index
the entire volume in the given batcheswith the new scheme
is actually less than the time required by the old schemeto
index the entire volume in a single batch. This rather surprising result is due to the elimination in the new scheme

0

20

40

60

60

100

CumulativePostings Indexed (millions)
Figure 3: Per batch update times for files in TIPSTER
volume 1, plotted versusthe total numberof postings in the
indexed collection after the update.
of a number of sequentialpassesover the temporary transaction files. Recall from Figure 1 that in order to handle
a large documentcollection, the old schememust split and
merge transaction files in steps 3 and 4. The new scheme
eliminates these steps,saving up to four sequential passes
through the large temporary files. This representsa fairly
significant constantfactor and the difference in the slope of
the “Old, per update” and “New, cumulative” lines. Note
that a cumulative plot for the old schemeis not shown.
Clearly, that plot would quickly shoot up off of the top of
the chart in a super-linear fashion.
In Figure 4 we show the batchupdatetime per posting for
the two schemes.This is the time required to add the batch
to the index divided by the numberof postings in the batch.
Here we can more clearly see that the update cost of the
old schemegrows with the size of the existing index, while
the updatecost of the new schemeremains nearly constant.
The large peak for the old schemeduring the wsj89 update
is due to the relatively small size of the update compared
to the size of the existing collection. The slight dip for
both schemes(in Figures 3 and 4) during the fr update
occurs becausethe documents in fr are relatively large.
Comparedto a similar size batch with smaller documents,
there are fewer document identifier/weight entries in the
inverted lists for the samenumber of postings, and the per
posting updatecost is reduced.
4.4 Small Updates
The batchesused above might be consideredlarge for applications that support periodic updates to the document
collection. Therefore, we also measuredour new scheme
on a rangeof small batch sizes.WeusedTIPSIER volume 1
for our existing indexed collection, and generatednew doc-

196

4
3.5
a
5
3
0
iE 2.5
6
F 2.
-=
8
t lo5
a.
1 .-i!
I- 0.5 01

+______--~---------+------..+c------.--+
-._______
/ ____-____
+
wsj87 wsj88 wsj89 doe

ziff

ap

fr

1

4

6

8

10

-I
12

Cumulative Postings Indexed (millions)

Batch Update

Figure 4: Update time per posting for batch updates in
TIPSTER volume 1.
Table 2: Characteristics of batches built from TIPSIER
volume 2. Termsare new with respectto all batchesearlier
in the table, with TIPSIER volume 1 as the base.
1 Batch 11Mb 1 Dots 1 Posts

2

I Terms 1

total 11127 1 27343 .I 11371306 1 39524 ]
ument batchesfrom volume 2 of the TIPSIER collection,
which contains the same types of files as volume 1. To
build the batches,we selecteddocumentsfrom the different
file types in round robin fashion until we had built seven
batchesranging in size from 1 Mbyte to 64 Mbytes, by
powersof 2. Statisticsfor the batchesare given in Table 2.
Figure 5 showsthe time required to incrementally index
eachof the batchesand cumulatively add them to the existing index. The data points moving left to right correspond
to batches 1 through 7 in Table 2, and are plotted against
the cumulative postings from the batches.We also plot the
time per posting for each batch update in Figure 6. The
figures show that asthe size of the batch increases,the cost
per posting decreases.This is due to the following factors.
First, when building new inverted lists for a batch, enough
(maybeall) of the term dictionary must bereadin to identify
all of the termsin the batch andmakeentries for new terms.
Similarly, the dictionary must again be read in during the

197

Figure 5: Per ‘batch update times for range of batch sizes
cumulatively added to TIPSIER volume 1 index, plotted
versusthe cumulative postings from the batches. Batches
range in size from 1 Mbyte to 64 Mbytes by powers of 2.
build phase when adding the new lists to the existing index. With larger batches,this cost is amortized over more
postings. Second,larger batcheswill benefit from locality
in the small and medium object pools. The inverted lists
in thesepools are clustered in segments,such that reading
in a list to modify it causesall lists in the samesegment
to be read in. The cost of reading in the segmentcan then
be amortized over any updates to other lists in the same
segmentthat occur before the segmentis flushed out.
In an effort to better accommodatesmall batch updates,
we implementedan in-memory version of our systemwhich
performs updateson-line. Rather than build complete inverted lists for the batch off-line and add them to the index
in a secondstep,the in-memory version builds the inverted
lists for the batch in main memory as the documents in
the batch are parsed. When the main memory inverted list
buffer becomesfull, the partial inverted lists for the batch
are addedto the existing index. The rationale behind this
schemeis that we will eliminate redundant I./O causedby
writing transactions, sorting off-line to build the inverted
lists, and then reading the lists again to add them to the
index. To test this scheme,we allocated 10 Mbytes for the
inverted list buffer (which accommodatesaround 800,000
compressedpostings), 20 Mbytes of Mneme buffer space
for the existing inverted index, and 24 Mbytes of Mneme
buffer spacefor the term dictionary. We found, however,
that for batch sizes up to 8 Mbytes the in-memory version
was no faster than the off-line version, and much worse
for larger batches.The poor performanceon batcheslarger
than 8 Mbytes is due to the inverted list buffer being flushed
to the index multiple times per update. The surprisingly
mediocreperformancefor batchesthat fit entirely in the in-

6
5 .
4 .
3
2
1
I

OL
1Mb

2Mb

4Mb

6Mb

16Mb 32Mb 64Mb

Batch Update

Figure 6: Update time per posting for range of batch sizes
cumulatively addedto TIPSIER volume 1 index.
vet-tedlist buffer is attributed to insufficient main memory.
Besides the buffers mentioned above, the on-line version
requires additional spaceto parsethe documentsand build
the inverted lists in the main memory buffer. We suspect
that the systemis paging excessively,although due to a bug
in OSFA V1.3, we could not obtain processresourceusage
statisticsto confirm this. Wedid observe,however,that running the in-memory version with a 20 Mbyte inverted list
buffer on a DECSystem3000/500with 128Mbytes of main
memory gave much improved performance. Therefore, we
suggestthat this technique bears further investigation and
may be appropriate for certain applications with adequate
resourcesand small updates.
4.5 Space Considerations
In addition to time, we must consider the disk spacerequired by each indexing scheme. There are two aspects
to disk space: permanent space consumed by the index
files, and additional temporary spacerequired for indexing.
To reduce the amount of permanent space consumed by
the inverted file, the inverted lists are compressedin a two
stepprocess.First, the location information associatedwith
eachdocumententry for a given term is run-length encoded,
wherethe first location is storedasan absolutevalue and all
subsequentlocations are stored as deltas from the previous
location. This yields numbersof significantly smaller magnhude. Then, all numbers (document identifiers, weights,
and encodedlocations) are representedin base2 using the
minimum number of bytes (up to four), with a continuation
bit reserved in each byte. This results in variable length
numberswhere the largest representablenumber is 228.
Compressionyields the samespacesavings for both the
old and new indexing schemes. However, the old scheme
allocatesobjectsexactly of the required size, while the new

schemeallocates fixed size objects for the variable length
lists. Therefore, we would expect a certain amount of internal fragmentation and an increase in inverted file size
for the new scheme. The inverted file created in seven
batches by the new schemefor TIPSIER volume 1 consumes473 Mbytes, of which 420 Mbytes is inverted list
data. Internal object fragmentation, or free spacein objects that could be allocated in the future, accounts for
50 Mbytes, or about 12% of the size of the real data. The
remaining 3 Mbytes is Mneme overhead, including auxiliary file structures and links between objects, a relatively
insignificant 0.8% overhead. The total overhead is quite
tolerable, and other results show that as the index becomes
even larger, the overheaddecreasesslightly due to the full
utilization of all but the tail objects in the linked lists for
large inverted lists.
The difference in temporary disk spacerequirementsbetween the two schemesduring indexing is much more significant. The transaction files generatedby the old scheme
for TIPSIER volume 1 consumea total of 1.7 Gbytes, or
nearly 50% more disk spacethan the raw documentcollection. Each time an update is made,the transaction files for
the entire collection must be generated,resulting in temporary disk spacecosts proportional to the size of the total
collection. The new schemeonly requires enough spaceto
generateand sort the transactionfile for the new documents,
yielding temporarydisk spacerequirementsproportional to
the size of the update. Both schemescould benefit from using suitable compressiontechniqueson the transactionfiles
to reducetemporary disk spacerequirements,but again the
old schemewill still require temporary disk spaceproportional to the size of the existing database.
4.6 Query Processing
The last performanceissue is query processing,which we
evaluate by comparing query processingspeedson the inverted files built in Section4.3. Sincethe new schememust
assemblelarge inverted lists with multiple disk reads, we
expect its query processing performance to suffer. However, the allocation strategy for large lists, combined with
the small number of batch updatesto build the index, work
to reducethis effect. Here is why. Large inverted lists allocated in a linked list of large objects are grown by adding
new large objectsto the tail of the list. New large objectsare
eachcreatedin their own physical segment,and new physical segmentsare allocated at the end of the file. Therefore,
all objectsaddedto a given linked list during a single batch
update will be allocated sequentially at the end of the file.
Since the index was built in seven batch updates, a large
inverted list will consist of at most seven separateblocks
of contiguous large objects, greatly reducing the disk seek
coststo assemblethe list.
To further explore this effect, we measuredquery processing performance on an index built using the on-line

schemedescribed in Section 4.4 with a main memory inverted list buffer of 10 Mbytes. This produced an inverted
file laid out asif it had beenbuilt with many small batch updates,where each batch contained approximately 800,000
postings.
The query set we usedwas generatedlocally from TZPSTER topics 51-100 using automatic and semi-automatic
methods. We report the averageof running the query set
on each schemesix times, where each run differed from
the averageby less than 5%. The old scheme,with each
list stored as a single contiguous object, required 975 seconds. The new schemebuilt on-line to simulate many small
hatchesrequired 1256 seconds,or 28.8% longer than the
old scheme.The new schemebuilt off-line with sevenbatch
updatesrequired 1033 seconds,or only 5.9% longer than
the old scheme.
The results for the new index structure are surprisingly
good. The improvement from the secondto third scheme
above indicates that query processing can be greatly improved by simple periodic reorganization of the inverted
file to sequentially arrangethe objects in linked lists. We
expect that query processing can be further improved for
the following reasons. First, the query processing model
used in these experiments is “term-at-a-time”, where the
entire inverted list for a term is read and processedall at
once. Many modem systemshave adopted“document-ata-time” processing[Wil84], which calculatesthe complete
score for one document before proceeding to the next. In
this model, the inverted lists are read in small chunks, a
technique ideally suited to the linked list structure. Second, our inverted file structure might be combined with
the query optimization techniquesproposedby Wong and
Lee [WL93] and Moffat and Zobel [MZ94b], who describe
methodsfor eliminating processingon portions of inverted
lists. Again, the linked list structure could be usedto avoid
I/O on theseportions of the lists. Third, buffer management
has not been tuned for the new file structure. Fourth, we
have not considered any low level optimizations, such as
overlapping I/O with processing,sophisticatedpre-fetching
schemes,or disk optimizations suchasstriping, all of which
will reducethe time to assemblelarge inverted lists.

5 Related Work
Efficient managementof full-text databaseindexes has received a fair amountof attention. Faloutsos[Fal85] gives an
early survey of the common indexing techniques. The two
techniquesthat seemto predominateare signaturefiles and
inverted files. SinceINQUERY usesan inverted file index,
we do not discuss signature files. Zobel et al. [ZMSD92]
investigate the efficient implementation of an inverted file
index for a full-text databasesystem. Their focus is on
compressiontechniquesto limit the size of the inverted file
index. These techniques could be usefully incorporated
into our system. They also addressupdatesto the inverted

file using large fixed length disk blocks, where each block
has a heap of inverted lists at the end of the block and a
directory into the heap at the beginning of the block. As
inverted lists grow they arerearrangedin the heapor copied
to other blocks with more space. Techniquesfor handling
inverted lists larger than a disk block are not discussed,nor
is the disk block technique fully evaluated. Our experience
indicatesthat efficient managementof large inverted lists is
critical to performance,andwe presentexperimentalresults
demonstratingthe effectivenessof our solution.
Tomasic et al. [TGMS94] propose a new data structure
to supportincrementalindexing, and presenta detailed simulation study over a variety of disk allocation schemes.The
study is extendedwith a larger synthetic document collection in [STGM94], and a comparisonis made with the traditionalindexing technique. Their data structure manages
small inverted lists in buckets (similar to the disk blocks
in [ZMSD92]) and dynamically selectslarge inverted lists
to be managedseparately,not unlike our useof different object pools for different sizedlists @3CCM94].Their simulation results indicate that the bestlong list allocation scheme
for updateperformanceis to write the new portion of a long
list in a new chunk at the end of the file. This is essentially
what we do with our linked lists. However, they predict
that query performancewith this strategywill be poor. On
the contrary, we have shown with an actualimplementation
that our linked list strategy can in fact provide good query
performance,while simultaneously providing superior updateperformance.Moreover, their simulations assumethat
all buckets can fit in main memory during indexing, potentially requiring significant main memory resources.Our
schememakesno such assumption,requiring substantially
less main memory.
Another schemethat handles large lists distinctly from
small lists is proposedby Faloutsos and Jagadish[FJ92a].
In their scheme,small lists arestoredasinverted lists, while
large lists are stored as signature files. Again, we are primarily concernedwith inverted lists anddo not considersignaturefile solutions. In [FJ92b], FaloutsosandJagadishexamineupdateand storagecostsfor a family of long inverted
list implementations, where the general caseis their “HYBRID” scheme. The HYBRID schemeessentially chains
together chunks of the inverted list and provides a number of parametersto control the size of the chunks and the
length of the chains. At one extreme, limiting the length
of a chain to one and allowing chunks to grow results in
contiguous inverted lists, where relocation of the inverted
list into a larger chunk is required when the current chunk
is filled. At the other extreme,fixed size chunks and unlimited chain lengths give a standardlinked list. Our overall
schemedoesnot fit into this model since we initially grow
chunks and then chain fixed size chunks. However, our
small lists can be modeled as chains of length one where
chunks are doubled in size during relocation, and our large
lists can be modeled as unlimited length chains with fixed

size chunks. Ratherthan argueanalytically, we haveshown
experimentally that our schemeprovides good update and
searchcosts,with acceptablespaceoverheads.
Cutting and Pedersen[CP90] investigate optimizations
for dynamic update of inverted lists managed with a Btree. For a speedoptimization, they proposeaccumulating
postings in a main memory postings buffer, and give both
analytical and experimental results. It is difficult to make
comparisonswith their experimental results due to the size
of the collections used. We presentresults for a collection
nearly 100 times larger. We agreethat updatesshould be
batched, but our experience with an in-memory scheme
indicates that as soon as the batch becomestoo large to
invert in main memory (i.e., partial inverted lists for the
batchmust beflushedto the index beforethe restof the batch
can be processed),an off-line schemewill provide better
performance. Again, this requires further research. They
alsoproposestoring the smallestinverted lists directly in the
B-tree index. An equivalent schemeusing our hash term
dictionary would be advantageousonly if the dictionary
did not increasein size. Increasing I/O and main memory
costsfor the term dictionary for the sakeof rarely accessed
inverted lists would be disastrous.
A number of other approachesto document indexing
have beenproposed. Fox and Lee [FL911 describea technique that eliminates the sorting involved in indexing by
making multiple passesover the input documents. Indexing is divided into loads, wherea load is a contiguouschunk
of the final inverted file. First, an initial passover the input
is made to determine the load boundaries. Then, a subsequentpassis made for each load, processingonly terms
that fall within the boundariesof the load and building the
inverted lists for those terms in main memory. At the end
of the pass, the inverted lists for the load can simply be
appendedto the inverted file.
Wttten et al. [WMB94] presenta variety of indexing algorithms, including an extended version of Fox and Lee’s
algorithm. They also enhance the traditional sort-based
methodwith compressiontechniquesand in-place and multiway merging to greatly improve efficiency in terms of
main memory, disk space,and time. Results of applying
some of these techniques to the TESTER document collection are presentedin [MZ94al.
All of theseother approacheswere developedfor large
static documentcollections, and do not directly support incremental indexing. Some of the techniques, such as the
sorting enhancementsand making multiple passesthrough
the input to “‘pre-allocate” the output, might be usefully
incorporated into an incremental system. Their benefit,
however, would be dependenton the size of the incremental batches,with larger batchesderiving more benefit. A
better integration might use one of the above algorithms
the first time a large collection is indexed, and switch to an
incremental technique thereafter.
Properly modeling the size distribution of inverted lists

is addressedby Wolfram in [Wol92a,Wol92b]. He suggests
that the informetric characteristicsof document databases
should be taken into consideration when designing the files
usedby an IR system. We follow this advice, ascan be seen
in Section 3, with what we consider to be very satisfactory
results.

6 Conclusions
If IR systemsare to satisfy the demandfor applications that
can managean ever increasing repository of information,
they must be able to efficiently add documentsto large existing collections. The main bottleneck in that operation is
updating the index structureusedto managethe collection.
The traditional solution to this problem is to re-index the
entire collection, an operation with costs proportional to
the size of the whole collection. This solution is clearly
unacceptable.
We have proposed an alternative solution that yields
costs proportional to the size of the update. Using the
data managementfacilities of a persistentobject store, we
have designeda more sophisticatedinverted file index that
provides fast incremental updates. More importantly, we
have implemented our schemein an operational full-text
information retrieval system and verified its performance
empirically.
The results we presentshow that our schememaintains
a nearly constantper posting updatecost as the size of the
collection grows, indicating excellent potential for scale. In
fact, we haveusedour schemeto index the full 2 Gbyte TIPSIER collection in 13 batchesand have found the trends
describedin Section 4 to hold. Our schemerequires considerably less disk spaceduring indexing than traditional
techniques, and allows much of the processing for a new
batch of documentsto be done independently from the existing index. This last point is particularly important for
the eventual support of simultaneousquery processingand
collection updating, since the period of time during which
index structuresmust be locked for updating can be minimized.
We found that bestperformanceis achievedwhen documentsareaddedin the largestbatchespossible,both in terms
of incremental indexing time and resultant query processing speed. We have also shown that our schemeprovides
a good level of performancewith small batch updates,and
have suggestedtechniquesto improve both small batch update and query processingperformance. These techniques
bear further investigation and representfuture work.
Finally, we have achieved these results using “off-theshelf” data managementtechnology, continuing to show
that the data managementfacilities in IR systemsneednot
be custom built to achievehigh performance.

200

Acknowledgements
We gratefully acknowledgeEliot Moss and the anonymous
refereesfor their comments and suggestionsfor improvements.

References
[BCCM94] Eric W. Brown, JamesP.Callan, W. Bruce
Croft, and J. Eliot B. Moss. Supporting
full-text information retrieval with a
persistentobject store. In Proc. ofthe 4th
Inter Conf on Extending Database
Technology,pages365-378, March 1994.
[CCH92]

[cool

JamesP.Callan, W. Bruce Croft, and
StephenM. Harding. The INQUERY
retrieval system. In Proc. ofthe 3rd Inter.
Conf on Databaseand Expert Sys.Apps.,
September1992.
Doug Cutting and Jan Pedersen.
Optimizations for dynamic inverted index
maintenance In Proc. of the 13th Intel: ACM
SIGIR Con&on Res.and Develop. in Infol:
Retl:, pages405-4 11,199O.

[Fa185]

Christos Faloutsos. Accessmethodsfor text.
ACM Computing Surveys,17:X%74,1985.

[FJ92a]

Christos Faloutsosand H. V. Jagadish.
Hybrid index organizationsfor text
databases.In Ptoc. of the 3rd Inter: Conf on
Extending Database Technology,pages
310-327,1992.

[FJ92b]

Christos Faloutsosand H. V. Jagadish. On
b-tree indices for skeweddistributions. In
Ptoc. of the 18th Intel: Conf on VLDB,
pages363-374, Vancouver, 1992.

u=Jll

Edward A. Fox and Whay C. Lee.
FAST-INV: A fast algorithm for building
large inverted files. Technical Report
TR-91-10, VPI&SU Departmentof
Computer Science,Blacksburg, VA, March
1991.

[Hat941

Donna Harman. editor. The SecondText
REtrieval Conference(TREC2). National
Institute of Standardsand Technology
Special Publication, Gaithersburg,MD,
1994.

[Hea78]

H. S. Heaps. Information Retrieval,
Computational and Theoretical Aspects.
Academic Press,Inc., New York, 1978.

[HFBYL92] Donna Harman, Edward Fox, Ricardo
Baeza-Yates,and Whay Lee. Inverted files.
In William B. Frakes and Ricardo
Baeza-Yates,editors, Information Retrieval:
Data Structures & Algorithms, chapter 3,
pages28-43. Prentice Hall, Englewood
Cliffs, NJ, 1992.
[Mos90]

J. Eliot B. Moss. Design of the Mneme
persistent object store. ACM Trans. Inf
Syst.,8(2):103-139, April 1990.

[MZ94a]

Alistair Moffat and Justin Zobel.
Compressionand fast indexing for
multi-gigabyte text databases.Australian
Comput.I., 26( 1):l-9, February 1994.

FIZ94bl

Alistair Moffat and Justin Zobel.
Self-indexing inverted files for fast text
retrieval. Technical Report 94/2,
Collaborative Information Technology
ResearchInstitute, Departmentof Computer
Science,Royal Melbourne Institute of
Technology,Australia, February 1994.

[SM83]

Gerard Salton and Michael J. McGill.
Introduction to Modem Information
Retrieval. McGraw-Hill, New York, 1983.

[STGM94] Kurt Shoens,Anthony Tomasic,and Hector
Garcia-Molina. Synthetic workload
performanceanalysis of incremental
updates.In Ptoc. of the 17th Intel: ACM
SIGIR Conf on Res.and Develop. in Infol:
Retr, Dublin, July 1994.
[TC91]

Howard Turtle and W. Bruce Croft.
Evaluation of an inference network-based
retrieval model. ACM Trans.Inf Syst.,
9(3):187-222, July 1991.

W921

Howard R. Turtle and W. &We Croft. A
comparisonof text retrieval models.
Comput..I., 35(3):279-290,1992.

[TGMS94] Anthony Tomasic,Hector Garcia-Molina,
and Kurt Shoens.Incremental updatesof
inverted lists for text documentretrieval. In
Proc. of the ACM SIGMOD Inter: Conf on
Managementof Data, Minneapolis, MN,
May 1994.
[Wi184]

201

PeterWillett. A nearestneighbour search
algorithm for bibliographic retrieval from
multilist files. Inform. Tech.,3(2):78-83,
April 1984.

[wL931

Wai Yee PeterWong and Dik Lun Lee.
Implementations of partial documentranking
using inverted files. Inf: Process. & Mgmnt.,
29(5):647-669,1993.

[WMB94]

Ian H. Witten, Alistair Moffat, and
Timothy C. Bell. Managing Gigabytes:
Compressing and Indexing Documents and

Images. Van Nostrand Reinhold, New York,
1994.
[Wo192a]

Dietmar Wolfram. Applying informetric
characteristicsof databasesto IR systemfile
design, Part I: informetric models. Znf.
Process. & Mgmnt., 28(1):121-133,1992.

[Wo192b]

Dietmar Wolfram. Applying informetric
characteristicsof databasesto IR systemfile
design, Part II: simulation comparisons.Znf:
Process. & Mgmnt., 28(1):135-151,1992.

[Zip491

GeorgeKingsley Zipf. Human Behavior and
the Principle of Least Effort.

Addison-WesleyPress,1949.
[ZMSD92] Justin Zobel, Alistair Moffat, and Ron
Sacks-Davis.An efficient indexing
technique for full-text databasesystems.In
Proc. of the 18th Intel: Co& on VLDB,

pages352-362, Vancouver, 1992.

202