Fast Incremental Indexing for Full-Text Information Retrieval* Eric W. Brown brown@cs.umass.edu James P. Callan W. Bruce Croft callanQcs.umass.edu croft@cs.umass.edu Departmentof Computer Science University of Massachusetts Amherst, MA 01003 USA tion. Applications such as information filtering and daily news feed services are constantly processing new docuAbstract ments. If IR systemsare to support such applications, they must be able to managecontinually growing documentcolFull-text information retrieval systemshave tradilections. tionally beendesignedfor archival environments. A prerequisiteto supporting a growing documentcollecThey often provide little or no support for adding tion is the ability to updatethe data structuresusedto index new documents to an existing document collecthe collection. An indexing structure usedby many IR systion, requiring insteadthat the entire collection be tems is the inverted file index [SM83, Fal85, HFRYL921. re-indexed. Modem applications, such as inforAn inverted file index consistsof a record, or inverted list, mation filtering, operatein dynamic environments for each term that appears in the document collection. A that require frequent additions to documentcollecterm’s inverted list stores a documentidentifier and weight tions. We provide this ability using a traditional for every document in which the term appears.The weight inverted file index built on top of a persistentobmight simply be the number of times the term appearsin ject store. The data managementfacilities of the the document, or a more sophisticated measureof the sigpersistent object store are used to produce effinificance of the term’s appearance in the document. Addicient incremental updateof the inverted lists. We tionally, the location of each occurrence of the term in the describeour systemand present experimental redocument may be stored in order to support queries based sults showing superior.incremental indexing and on the relative positions of terms within documents. competitive query processingperformance. When a batch of new documentsis addedto an existing Keywords: full-text document retrieval, incredocument collection, a small number of the terms in the mental indexing, persistent object store, perforbatch will be new to the collection, while the majority of mance the termsin the batch will already have inverted lists in the index. These lists must be updated by appending to them 1 Introduction the term occurrencesfound in the new documents.This task Full-text information retrieval (IR) systemsare well estab- is madedifficult by the size characteristicsof inverted lists and the techniques used to managethem in traditional IR lished tools for satisfying a user’s information need when systems. The inverted lists for a multi-gigabyte document the information baseis a relatively static collection of doccollection will range in size from a few bytes to millions of uments. However, modem information managementsysbytes, and they are typically laid out contiguously in a flat temsmust be able to handle a steadyinflux of new informainverted file with no gapsbetweenthe lists. *Thisworkis supported by theNationalScience Foundation Center Adding to inverted lists storedin such a fashion requires forIntelligentInformation Retrieval attheUniversity of Massachusetts. expensive relocation of growing lists and careful managePermission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial ment of free-spacein the inverted file. Rather than update advantage, the VLDB copyright notice and the title of the publication and existing inverted lists when adding new documents,many its date appeas and notice is given that copying is by permission of the IR systemssimply rebuild the inverted file by adding the Very Large Data Base Endowment. To copy otherwise, or to republish, new documentsto the existing collection and indexing the requires a fee and/or special permission from the Endowment. entire collection from scratch. This technique is expensive Proceedings of the 20th VLDB Conference in terms of time and disk space,resulting in update costs Santiago, Chile, 1994 192 proportional to the size of the total collection after the addition. Instead, we would prefer the cost of the update to be proportional to the size of the new documentsbeing added. The INQUERY full-text information retrieval system [TC91, CCH92] provides this desirable update performanceusing the Mneme persistentobject store [Mos90] to manageits inverted file index [BCCM94]. The key to providing fast incrementalindexing is i unique inverted file structure made possible by the data managementfacilities of the persistentobject store. Inverted lists are allocated in fixed size objects with a finite range of sizes, limiting the numberof relocations a growing list will experience. When an inverted list has exceededthe largest object size, additional large objects are allocated and chained together in a linked list. Experimental results show that our system is able to provide superior incremental indexing performance in terms of both time and space,with only a small impact on query processing. In the next section we briefly describe the main architectural features of INQUBRY and Mneme. In Section 3 we discuss our inverted file structure in detail, including the motivation for the design. We presentexperimental results in Section4 comparing the performanceof our system with traditional techniques. In Section 5 we discussrelated work, and we offer concluding remarks in Section 6. Contributions of our work include a technique for supporting incremental update of inverted lists in full-text information retrieval systems,along with an empirical analysis of a working implementation. Also, we continue to demonstrate that data managementfacilities for IR systemsneed not be custombuilt to obtain superior performance.Rather, IR systemscan be effectively supportedusing appropriate “‘off-the-shelf data managementsoftware. 2 Architecture In this section we highlight some of the basic features of INQUERY and Mneme that are relevant to this work. Throughout the paper we will refer to the system as it is describedhere as the “old” version. 2.1 INQUERY INQUERY is a probabilistic information retrieval system basedupon a Bayesian inference network model [TC91, CCH92]. The power of the inference network model is the consistent formalism it provides for reasoning about evidenceof differing types,allowing multiple retrieval models, documentrepresentations,and query representationsto be combinedsimultaneously. Extensive testing hasshown INQUERY to beoneof the bestIR systems,asmeasuredby the standardIR metrics of recall and precision [Har94, TC921. INQUERY is fast, scaleswell to large documentcollections, and can be embeddedin specialized applications. In INQUBRY, document retrieval is accomplished by combining evidencefrom the documentcollection with ev- idencefrom the query to producearanking of the documents in the collection. The evidence for a document collection is pre-computedand stored as the weights and locations in an inverted file index. During retrieval, the inverted list for each term in the query is accessedand the evidence in the lists is accumulatedand combined as dictated by the operators in the query. The inverted list for a term is obtainedby looking up the term in the term dictionary. The term dictionary is built as a hash table with open-chaining conflict resolution. A term’s entry in the dictionary contains collection statistics for the term and a referenceto the term’s inverted list. Inverted lists are stored as Mneme objects, where a single object of the exact size is allocated for each inverted list. 2.2 Mneme The Mneme persistent object store [Mos90] was designed to be efficient and extensible. The basic servicesprovided by Mneme are storage and retrieval of objects, where an object is a chunk of contiguous bytes that hasbeenassigned a unique identifier. Mneme has no notion of type or class for objects. The only structure Mneme is aware of is that objectsmay contain the identifiers of other objects,resulting in inter-object references. Objects are grouped into files supportedby the operating system. Within files, they are physically grouped into physical segments. A physical segmentis the unit of transfer betweendisk and main memory and is of arbitrary size. Objectsare also logically groupedinto pools, where a pool defines a number of managementpolicies for the objects contained in the pool, such as how large the physical segmentsare,how the objectsarelaid out in aphysical segment, how objects are located within a file, and how objects are created. Object format is determinedby the pool, allowing objects to be stored in the format required by the application that usesthe objects (modulo any translation that may be required for persistent storage, such as conversion of main memory pointers to object identifiers). Pools provide the primary extensibility mechanismin Mneme. By implementing new pool routines, the systemcan be significantly customized. The base system provides a number of fundamental mechanismsand tools for building pool routines, including a suite of standardpool routines for file and auxiliary table management. Support for sophisticatedbuffer managementis provided by an extensible buffering mechanism. Buffers may be defined by supplying a number of standard buffer operations(e.g.,allocate andfree) in a systemdefined format. How theseoperationsare implementeddetermines the policies usedto managethe buffer. 193 3 Indexing The old version of INQUBRY usesthe traditional method of indexing a documentcollection and building the inverted I 100 90 Document File 1. Parsing, Document Inversion I 80 Transaction File I 2. Sorting I I Sorted Transaction File 3. Splitting I 1 IO 100 1K 10K 1OOK 1M Inverted List Size (bytes) 4. Merging I /l Merged Segments 5. Inverted List Accumulation I Inverted Lists Figure 1: Traditional indexing process. file [HFBYL92]. This method is referred to as the alfernarive schemein [STGM94]. The processinvolves multiple steps,diagramedin Figure 1. The input to the processis a file of documents,which in INQUERY may currently be at most 256 Mbytesl. Collections larger than this limit must be broken up into multiple files. In step 1, document parsing assignsan identifier to each document and extracts the terms from the documents,stemming each term to its root and eliminating any stop words (words too frequent to be worth indexing). A surviving term must then be located or inserted in the term dictionary to obtain its term identifier and updatethe statistics stored there. After eachdocument is parsed, it is inverted in main memory and a temporary file of trunsacfions is generatedfor the document. A transaction consistsof a term identifier, the documentidentifier, and the locations of each occurrence of that term in the document, representing the portion of the inverted list for the term that is contributed by the document. To convert the transactionsinto inverted lists, all of the per document inverted list portions for each term,must be combined by sorting on the term identifier and document identifier. The transactionfiles may be over one-and-a-half times the size of the document files, however, so for large (multi-gigabyte) collections, it is impractical or impossible to combine the transactionsfor all of the documentsinto a ‘This is due to the current encoding method used for document file byte offsets. The largest number that can be encoded is 228. Figure 2: Cumulative distributions over inverted list size for TIPSIER volume 1, with 627072lists and420 Mb total. single file. Instead, the transactionsare stored in multiple files, one for eachdocument file. The transaction file from eachdocument file is sortedin step 2, and the sorted transaction file is split one or more times into head and tail segmentsin step 3. All of the transaction files are split at the sameterm tk, such that the head segmentfor each transaction file will contain terms it1 . . . tk} and the tail segment for each transaction file will contain terms {tk+l . . . t,}, where there are n terms in the vocabulary. The corresponding segmentsfor each transaction file can then be merged in step 4 to produce a fully sorted set of transactions for the entire document collection, broken up into multiple segments. The final step is to build the inverted file. This is simply a matter of reading eachinverted list from the sortedtransactions, compressingthe list, creating a Mneme object for the list, and storing the object identifier for the list in the term dictionary entry for the term. In step5, the transaction segmentsare processedin order and the inverted lists are created. Obviously, steps3 and 4 are required only when there are too many transactionsfrom the document collection to handlein a single file. Thesestepsmay be eliminated when indexing smaller collections. Since the entries in eachinverted list are sorted by document identifier, if new documentsare always assignedincreasingdocumentidentifiers, their inverted list entriesmay simply be appendedto any existing inverted lists. Therefore, adding new documentsto an existing collection does not necessarilyrequire that the entire document collection be re-indexed. However, it doesrequire the ability to grow inverted lists. Providing this functionality is non-trivial, such that many IR systems,the old INQUERY included, simply re-index the entire collection. Using the full functionality of Mneme, however, we can create a more sophisticatedinverted file structure that will 194 in fact support growing inverted lists. To guide this design, we first considersomeof the characteristicsof inverted lists, all of which can be derived from the early observationsof Zipf [Zip49]. Figure 2 shows the distribution of inverted list sizes for the TIPSIER volume 1 document collection used in our performance evaluation below (see Table 1). For a given inverted list size, the figure shows how many lists in the inverted file are less than or equal to that size, and how much those lists contribute to the total file size. Our first observation is that over 90% of the inverted lists are less than 1000 bytes (in their compressedform), and account for less than 10% of the total inverted file size. Furthermore, nearly half of all lists are less than 16 bytes. This meansthat many inverted lists will never grow after their initial creation. Therefore, they should be allocated in a spaceefficient manner, i.e., reserving extra spacefor these lists in anticipation of growth would be a mistake. Also, it is well known that the vocabulary will continue to grow indefinitely [Hea78], so we must always be prepared to createmore of thesesmall inverted lists. Most of the inverted file size is accounted for by a very small number of large inverted lists. These inverted lists will experience continuous, possibly vigorous growth. We have also observedthat these large lists have a high probability of being accessedduring query processing [BCCM94], so they must be allocated in a mannerthat affords efficient access. The main extension we make to the old inverted file structure is this; instead of allocating each inverted list in a single object of the exact size, we allocate lists using a range of fixed size objects, where the sizes range from 16 to 8192 bytes by powers of 2 (i.e., 16, 32, 64, . . .. 8192). When a new list is created, an object of the smallest size large enough to contain the list is allocated. A list can then grow to fill the object. When it exceedsthe object, a new object of the next larger size is allocated,the contentsof the old object are copied into the new object, and the old object is freed. When a list exceedsthe largest object size (8192 bytes), rather than copy the list into an even larger object, we start a linked list of 8192 byte objects. Inverted list growth is then accomplishedby appendingto the tail object in the linked list, and adding a new object to the linked list when the tail is full. The largestobjectsare eachallocated in their own physical segment and managed by a large object pool. The smallest (16 byte) objects are stored in 4 Kbyte physical segments,255 objectsper segment,andmanagedby a smull object pool. The remaining objects are stored in 8 Kbyte physical segments,where each segmentstores objects of only one size, and contains as many objects of that size as will fit. These objects are managedby a medium object pool. This schemeefficiently allocatesthe large numberof relatively small inverted lists in the small and medium object pools, limits the numberof times a list will be relocateddue 195 to growth, and most importantly, needsto accessonly the tail object when growing a large inverted list, leaving the majority of the data in the inverted file untouchedduring an update. Now, the inverted file index can be built incrementally. To add a batch of documents to the index, we parsejust the new batch, sort the generated transaction file to create inverted lists for the new documents,and then add the inverted lists to the existing inverted file, creating andgrowing inverted lists as described above. This correspondsto steps1.2, and 5 in Figure 1, with step 5 modified appropriately. The first two stepsare performed independently of the existing index. They are also disk oriented operations, with small main memory requirements. Furthermore, they produce complete inverted lists for the new documents. During the final step an inverted list in the index will be updatedonly once, minimizing the number of costly relocations. Consequently,best performanceis achieved when new documentsare addedto the index in the largest batches possible, reducing the overall number of updates. 4 Performance Evaluation To evaluate our new inverted file structure, we measured incremental indexing speed,disk spacerequirements, and query processing speed using the resulting inverted file. For comparison, we also measuredthe same costs using traditional techniques. Below we describeour experimental platform, the test collection used, and the results of our measurements. 4.1 Platform All of our experiments were run as superuserwith logins disabledon an idle DECSystem3000/400(Alpha AXP CPU clocked at 133 MHz) running OSF/l V1.3. The system was configured with 64 Mbytes of main memory and six 1.3 Gbyte RZ58 SCSI disks. The executableswere compiled with the DEC C compiler driver 3.11 using optimization level 2. All of the data files and executables were stored on the local disks, and a 64 Mbyte “chill file” was read before each batch update or query processing run to purge the operating system file buffers and guaranteethat no inverted file data was cached by the file system across runs. All times reported below are real (wall clock) time. 4.2 Test Collection For our experiments, we used volume 1 of the TIPSTER document collection, a standard test collection in the IR community. Volume 1 is a 1.2 Gbyte collection of full-text articles and abstracts,divided into sevenmain files. Table 1 gives the relevant statistics for each of the files. The first three files contain Wall Street Journal articles from years 1987, 1988, and 1989 respectively, doe contains Department of,Energyabstracts,ziff containsarticles and abstracts Table 1: TIPSIER volume 1 file characteristics. Termsare new with respectto all files earlier in the table. File 11 Mb 1 Dots 1 Posts 50 - I Terms 1 z 40 ii 30 Q c g 20 Old, per update New, cumulative -+--New, per update --w. F .a- ......._....._...._. D_ e ___............... D....‘-..__...... 10 1 total 111197.0 1 510887 1 105932530 1 627072 1 .. ., W’ 0 from various periodicals, ap contains AssociatedPressarticles, and fr contains Federal Register articles. For all indexing and query processing, we use stemming to reduce words to roots, and a stop words list to eliminate the frequent words not worth indexing. 4.3 Large Updates Our first experimenttreatseachof the sevenfiles as a separatebatchupdateandincrementally indexesthe volume one file at a time, in the order listed in Table 1. The results are plotted in Figure 3. For the two “per update” lines (“Old, per update” and “New, per update”), eachpoint represents the time required to add the respectivebatch to the existing index (the first batch createsthe initial index), where the points moving left to right correspondto the files in Table 1 moving top to bottom. The per update times are plotted againstthe total number of posts in the indexed collection after the update. “Old” is the traditional schemethat re-indexesthe entire collection eachtime a new batch is added. The figure indicatesthat the time to add a batch is proportional to the size of theentire collection after the update. This schemeclearly doesnot scalewell with the size of the collection. “New” is the new schemewhich builds the inverted lists for an update batch separatelyand then adds them to the existing index, using the new inverted file structure. The time to add a batch is more proportional to the size of the batch, yielding incremental update times that are significantly better than in the old scheme. Each point for the old schemealso gives the total time required to index a collection of the cumulative size in one batch (since the entire collection is re-indexed). The comparabletimes for the new schemeare the cumulative times for the incremental updates,shown as “New, cumulative”. This plot shows that the total time to incrementally index the entire volume in the given batcheswith the new scheme is actually less than the time required by the old schemeto index the entire volume in a single batch. This rather surprising result is due to the elimination in the new scheme 0 20 40 60 60 100 CumulativePostings Indexed (millions) Figure 3: Per batch update times for files in TIPSTER volume 1, plotted versusthe total numberof postings in the indexed collection after the update. of a number of sequentialpassesover the temporary transaction files. Recall from Figure 1 that in order to handle a large documentcollection, the old schememust split and merge transaction files in steps 3 and 4. The new scheme eliminates these steps,saving up to four sequential passes through the large temporary files. This representsa fairly significant constantfactor and the difference in the slope of the “Old, per update” and “New, cumulative” lines. Note that a cumulative plot for the old schemeis not shown. Clearly, that plot would quickly shoot up off of the top of the chart in a super-linear fashion. In Figure 4 we show the batchupdatetime per posting for the two schemes.This is the time required to add the batch to the index divided by the numberof postings in the batch. Here we can more clearly see that the update cost of the old schemegrows with the size of the existing index, while the updatecost of the new schemeremains nearly constant. The large peak for the old schemeduring the wsj89 update is due to the relatively small size of the update compared to the size of the existing collection. The slight dip for both schemes(in Figures 3 and 4) during the fr update occurs becausethe documents in fr are relatively large. Comparedto a similar size batch with smaller documents, there are fewer document identifier/weight entries in the inverted lists for the samenumber of postings, and the per posting updatecost is reduced. 4.4 Small Updates The batchesused above might be consideredlarge for applications that support periodic updates to the document collection. Therefore, we also measuredour new scheme on a rangeof small batch sizes.WeusedTIPSIER volume 1 for our existing indexed collection, and generatednew doc- 196 4 3.5 a 5 3 0 iE 2.5 6 F 2. -= 8 t lo5 a. 1 .-i! I- 0.5 01 +______--~---------+------..+c------.--+ -._______ / ____-____ + wsj87 wsj88 wsj89 doe ziff ap fr 1 4 6 8 10 -I 12 Cumulative Postings Indexed (millions) Batch Update Figure 4: Update time per posting for batch updates in TIPSTER volume 1. Table 2: Characteristics of batches built from TIPSIER volume 2. Termsare new with respectto all batchesearlier in the table, with TIPSIER volume 1 as the base. 1 Batch 11Mb 1 Dots 1 Posts 2 I Terms 1 total 11127 1 27343 .I 11371306 1 39524 ] ument batchesfrom volume 2 of the TIPSIER collection, which contains the same types of files as volume 1. To build the batches,we selecteddocumentsfrom the different file types in round robin fashion until we had built seven batchesranging in size from 1 Mbyte to 64 Mbytes, by powersof 2. Statisticsfor the batchesare given in Table 2. Figure 5 showsthe time required to incrementally index eachof the batchesand cumulatively add them to the existing index. The data points moving left to right correspond to batches 1 through 7 in Table 2, and are plotted against the cumulative postings from the batches.We also plot the time per posting for each batch update in Figure 6. The figures show that asthe size of the batch increases,the cost per posting decreases.This is due to the following factors. First, when building new inverted lists for a batch, enough (maybeall) of the term dictionary must bereadin to identify all of the termsin the batch andmakeentries for new terms. Similarly, the dictionary must again be read in during the 197 Figure 5: Per ‘batch update times for range of batch sizes cumulatively added to TIPSIER volume 1 index, plotted versusthe cumulative postings from the batches. Batches range in size from 1 Mbyte to 64 Mbytes by powers of 2. build phase when adding the new lists to the existing index. With larger batches,this cost is amortized over more postings. Second,larger batcheswill benefit from locality in the small and medium object pools. The inverted lists in thesepools are clustered in segments,such that reading in a list to modify it causesall lists in the samesegment to be read in. The cost of reading in the segmentcan then be amortized over any updates to other lists in the same segmentthat occur before the segmentis flushed out. In an effort to better accommodatesmall batch updates, we implementedan in-memory version of our systemwhich performs updateson-line. Rather than build complete inverted lists for the batch off-line and add them to the index in a secondstep,the in-memory version builds the inverted lists for the batch in main memory as the documents in the batch are parsed. When the main memory inverted list buffer becomesfull, the partial inverted lists for the batch are addedto the existing index. The rationale behind this schemeis that we will eliminate redundant I./O causedby writing transactions, sorting off-line to build the inverted lists, and then reading the lists again to add them to the index. To test this scheme,we allocated 10 Mbytes for the inverted list buffer (which accommodatesaround 800,000 compressedpostings), 20 Mbytes of Mneme buffer space for the existing inverted index, and 24 Mbytes of Mneme buffer spacefor the term dictionary. We found, however, that for batch sizes up to 8 Mbytes the in-memory version was no faster than the off-line version, and much worse for larger batches.The poor performanceon batcheslarger than 8 Mbytes is due to the inverted list buffer being flushed to the index multiple times per update. The surprisingly mediocreperformancefor batchesthat fit entirely in the in- 6 5 . 4 . 3 2 1 I OL 1Mb 2Mb 4Mb 6Mb 16Mb 32Mb 64Mb Batch Update Figure 6: Update time per posting for range of batch sizes cumulatively addedto TIPSIER volume 1 index. vet-tedlist buffer is attributed to insufficient main memory. Besides the buffers mentioned above, the on-line version requires additional spaceto parsethe documentsand build the inverted lists in the main memory buffer. We suspect that the systemis paging excessively,although due to a bug in OSFA V1.3, we could not obtain processresourceusage statisticsto confirm this. Wedid observe,however,that running the in-memory version with a 20 Mbyte inverted list buffer on a DECSystem3000/500with 128Mbytes of main memory gave much improved performance. Therefore, we suggestthat this technique bears further investigation and may be appropriate for certain applications with adequate resourcesand small updates. 4.5 Space Considerations In addition to time, we must consider the disk spacerequired by each indexing scheme. There are two aspects to disk space: permanent space consumed by the index files, and additional temporary spacerequired for indexing. To reduce the amount of permanent space consumed by the inverted file, the inverted lists are compressedin a two stepprocess.First, the location information associatedwith eachdocumententry for a given term is run-length encoded, wherethe first location is storedasan absolutevalue and all subsequentlocations are stored as deltas from the previous location. This yields numbersof significantly smaller magnhude. Then, all numbers (document identifiers, weights, and encodedlocations) are representedin base2 using the minimum number of bytes (up to four), with a continuation bit reserved in each byte. This results in variable length numberswhere the largest representablenumber is 228. Compressionyields the samespacesavings for both the old and new indexing schemes. However, the old scheme allocatesobjectsexactly of the required size, while the new schemeallocates fixed size objects for the variable length lists. Therefore, we would expect a certain amount of internal fragmentation and an increase in inverted file size for the new scheme. The inverted file created in seven batches by the new schemefor TIPSIER volume 1 consumes473 Mbytes, of which 420 Mbytes is inverted list data. Internal object fragmentation, or free spacein objects that could be allocated in the future, accounts for 50 Mbytes, or about 12% of the size of the real data. The remaining 3 Mbytes is Mneme overhead, including auxiliary file structures and links between objects, a relatively insignificant 0.8% overhead. The total overhead is quite tolerable, and other results show that as the index becomes even larger, the overheaddecreasesslightly due to the full utilization of all but the tail objects in the linked lists for large inverted lists. The difference in temporary disk spacerequirementsbetween the two schemesduring indexing is much more significant. The transaction files generatedby the old scheme for TIPSIER volume 1 consumea total of 1.7 Gbytes, or nearly 50% more disk spacethan the raw documentcollection. Each time an update is made,the transaction files for the entire collection must be generated,resulting in temporary disk spacecosts proportional to the size of the total collection. The new schemeonly requires enough spaceto generateand sort the transactionfile for the new documents, yielding temporarydisk spacerequirementsproportional to the size of the update. Both schemescould benefit from using suitable compressiontechniqueson the transactionfiles to reducetemporary disk spacerequirements,but again the old schemewill still require temporary disk spaceproportional to the size of the existing database. 4.6 Query Processing The last performanceissue is query processing,which we evaluate by comparing query processingspeedson the inverted files built in Section4.3. Sincethe new schememust assemblelarge inverted lists with multiple disk reads, we expect its query processing performance to suffer. However, the allocation strategy for large lists, combined with the small number of batch updatesto build the index, work to reducethis effect. Here is why. Large inverted lists allocated in a linked list of large objects are grown by adding new large objectsto the tail of the list. New large objectsare eachcreatedin their own physical segment,and new physical segmentsare allocated at the end of the file. Therefore, all objectsaddedto a given linked list during a single batch update will be allocated sequentially at the end of the file. Since the index was built in seven batch updates, a large inverted list will consist of at most seven separateblocks of contiguous large objects, greatly reducing the disk seek coststo assemblethe list. To further explore this effect, we measuredquery processing performance on an index built using the on-line schemedescribed in Section 4.4 with a main memory inverted list buffer of 10 Mbytes. This produced an inverted file laid out asif it had beenbuilt with many small batch updates,where each batch contained approximately 800,000 postings. The query set we usedwas generatedlocally from TZPSTER topics 51-100 using automatic and semi-automatic methods. We report the averageof running the query set on each schemesix times, where each run differed from the averageby less than 5%. The old scheme,with each list stored as a single contiguous object, required 975 seconds. The new schemebuilt on-line to simulate many small hatchesrequired 1256 seconds,or 28.8% longer than the old scheme.The new schemebuilt off-line with sevenbatch updatesrequired 1033 seconds,or only 5.9% longer than the old scheme. The results for the new index structure are surprisingly good. The improvement from the secondto third scheme above indicates that query processing can be greatly improved by simple periodic reorganization of the inverted file to sequentially arrangethe objects in linked lists. We expect that query processing can be further improved for the following reasons. First, the query processing model used in these experiments is “term-at-a-time”, where the entire inverted list for a term is read and processedall at once. Many modem systemshave adopted“document-ata-time” processing[Wil84], which calculatesthe complete score for one document before proceeding to the next. In this model, the inverted lists are read in small chunks, a technique ideally suited to the linked list structure. Second, our inverted file structure might be combined with the query optimization techniquesproposedby Wong and Lee [WL93] and Moffat and Zobel [MZ94b], who describe methodsfor eliminating processingon portions of inverted lists. Again, the linked list structure could be usedto avoid I/O on theseportions of the lists. Third, buffer management has not been tuned for the new file structure. Fourth, we have not considered any low level optimizations, such as overlapping I/O with processing,sophisticatedpre-fetching schemes,or disk optimizations suchasstriping, all of which will reducethe time to assemblelarge inverted lists. 5 Related Work Efficient managementof full-text databaseindexes has received a fair amountof attention. Faloutsos[Fal85] gives an early survey of the common indexing techniques. The two techniquesthat seemto predominateare signaturefiles and inverted files. SinceINQUERY usesan inverted file index, we do not discuss signature files. Zobel et al. [ZMSD92] investigate the efficient implementation of an inverted file index for a full-text databasesystem. Their focus is on compressiontechniquesto limit the size of the inverted file index. These techniques could be usefully incorporated into our system. They also addressupdatesto the inverted file using large fixed length disk blocks, where each block has a heap of inverted lists at the end of the block and a directory into the heap at the beginning of the block. As inverted lists grow they arerearrangedin the heapor copied to other blocks with more space. Techniquesfor handling inverted lists larger than a disk block are not discussed,nor is the disk block technique fully evaluated. Our experience indicatesthat efficient managementof large inverted lists is critical to performance,andwe presentexperimentalresults demonstratingthe effectivenessof our solution. Tomasic et al. [TGMS94] propose a new data structure to supportincrementalindexing, and presenta detailed simulation study over a variety of disk allocation schemes.The study is extendedwith a larger synthetic document collection in [STGM94], and a comparisonis made with the traditionalindexing technique. Their data structure manages small inverted lists in buckets (similar to the disk blocks in [ZMSD92]) and dynamically selectslarge inverted lists to be managedseparately,not unlike our useof different object pools for different sizedlists @3CCM94].Their simulation results indicate that the bestlong list allocation scheme for updateperformanceis to write the new portion of a long list in a new chunk at the end of the file. This is essentially what we do with our linked lists. However, they predict that query performancewith this strategywill be poor. On the contrary, we have shown with an actualimplementation that our linked list strategy can in fact provide good query performance,while simultaneously providing superior updateperformance.Moreover, their simulations assumethat all buckets can fit in main memory during indexing, potentially requiring significant main memory resources.Our schememakesno such assumption,requiring substantially less main memory. Another schemethat handles large lists distinctly from small lists is proposedby Faloutsos and Jagadish[FJ92a]. In their scheme,small lists arestoredasinverted lists, while large lists are stored as signature files. Again, we are primarily concernedwith inverted lists anddo not considersignaturefile solutions. In [FJ92b], FaloutsosandJagadishexamineupdateand storagecostsfor a family of long inverted list implementations, where the general caseis their “HYBRID” scheme. The HYBRID schemeessentially chains together chunks of the inverted list and provides a number of parametersto control the size of the chunks and the length of the chains. At one extreme, limiting the length of a chain to one and allowing chunks to grow results in contiguous inverted lists, where relocation of the inverted list into a larger chunk is required when the current chunk is filled. At the other extreme,fixed size chunks and unlimited chain lengths give a standardlinked list. Our overall schemedoesnot fit into this model since we initially grow chunks and then chain fixed size chunks. However, our small lists can be modeled as chains of length one where chunks are doubled in size during relocation, and our large lists can be modeled as unlimited length chains with fixed size chunks. Ratherthan argueanalytically, we haveshown experimentally that our schemeprovides good update and searchcosts,with acceptablespaceoverheads. Cutting and Pedersen[CP90] investigate optimizations for dynamic update of inverted lists managed with a Btree. For a speedoptimization, they proposeaccumulating postings in a main memory postings buffer, and give both analytical and experimental results. It is difficult to make comparisonswith their experimental results due to the size of the collections used. We presentresults for a collection nearly 100 times larger. We agreethat updatesshould be batched, but our experience with an in-memory scheme indicates that as soon as the batch becomestoo large to invert in main memory (i.e., partial inverted lists for the batchmust beflushedto the index beforethe restof the batch can be processed),an off-line schemewill provide better performance. Again, this requires further research. They alsoproposestoring the smallestinverted lists directly in the B-tree index. An equivalent schemeusing our hash term dictionary would be advantageousonly if the dictionary did not increasein size. Increasing I/O and main memory costsfor the term dictionary for the sakeof rarely accessed inverted lists would be disastrous. A number of other approachesto document indexing have beenproposed. Fox and Lee [FL911 describea technique that eliminates the sorting involved in indexing by making multiple passesover the input documents. Indexing is divided into loads, wherea load is a contiguouschunk of the final inverted file. First, an initial passover the input is made to determine the load boundaries. Then, a subsequentpassis made for each load, processingonly terms that fall within the boundariesof the load and building the inverted lists for those terms in main memory. At the end of the pass, the inverted lists for the load can simply be appendedto the inverted file. Wttten et al. [WMB94] presenta variety of indexing algorithms, including an extended version of Fox and Lee’s algorithm. They also enhance the traditional sort-based methodwith compressiontechniquesand in-place and multiway merging to greatly improve efficiency in terms of main memory, disk space,and time. Results of applying some of these techniques to the TESTER document collection are presentedin [MZ94al. All of theseother approacheswere developedfor large static documentcollections, and do not directly support incremental indexing. Some of the techniques, such as the sorting enhancementsand making multiple passesthrough the input to “‘pre-allocate” the output, might be usefully incorporated into an incremental system. Their benefit, however, would be dependenton the size of the incremental batches,with larger batchesderiving more benefit. A better integration might use one of the above algorithms the first time a large collection is indexed, and switch to an incremental technique thereafter. Properly modeling the size distribution of inverted lists is addressedby Wolfram in [Wol92a,Wol92b]. He suggests that the informetric characteristicsof document databases should be taken into consideration when designing the files usedby an IR system. We follow this advice, ascan be seen in Section 3, with what we consider to be very satisfactory results. 6 Conclusions If IR systemsare to satisfy the demandfor applications that can managean ever increasing repository of information, they must be able to efficiently add documentsto large existing collections. The main bottleneck in that operation is updating the index structureusedto managethe collection. The traditional solution to this problem is to re-index the entire collection, an operation with costs proportional to the size of the whole collection. This solution is clearly unacceptable. We have proposed an alternative solution that yields costs proportional to the size of the update. Using the data managementfacilities of a persistentobject store, we have designeda more sophisticatedinverted file index that provides fast incremental updates. More importantly, we have implemented our schemein an operational full-text information retrieval system and verified its performance empirically. The results we presentshow that our schememaintains a nearly constantper posting updatecost as the size of the collection grows, indicating excellent potential for scale. In fact, we haveusedour schemeto index the full 2 Gbyte TIPSIER collection in 13 batchesand have found the trends describedin Section 4 to hold. Our schemerequires considerably less disk spaceduring indexing than traditional techniques, and allows much of the processing for a new batch of documentsto be done independently from the existing index. This last point is particularly important for the eventual support of simultaneousquery processingand collection updating, since the period of time during which index structuresmust be locked for updating can be minimized. We found that bestperformanceis achievedwhen documentsareaddedin the largestbatchespossible,both in terms of incremental indexing time and resultant query processing speed. We have also shown that our schemeprovides a good level of performancewith small batch updates,and have suggestedtechniquesto improve both small batch update and query processingperformance. These techniques bear further investigation and representfuture work. Finally, we have achieved these results using “off-theshelf” data managementtechnology, continuing to show that the data managementfacilities in IR systemsneednot be custom built to achievehigh performance. 200 Acknowledgements We gratefully acknowledgeEliot Moss and the anonymous refereesfor their comments and suggestionsfor improvements. References [BCCM94] Eric W. Brown, JamesP.Callan, W. Bruce Croft, and J. Eliot B. Moss. Supporting full-text information retrieval with a persistentobject store. In Proc. ofthe 4th Inter Conf on Extending Database Technology,pages365-378, March 1994. [CCH92] [cool JamesP.Callan, W. Bruce Croft, and StephenM. Harding. The INQUERY retrieval system. In Proc. ofthe 3rd Inter. Conf on Databaseand Expert Sys.Apps., September1992. Doug Cutting and Jan Pedersen. Optimizations for dynamic inverted index maintenance In Proc. of the 13th Intel: ACM SIGIR Con&on Res.and Develop. in Infol: Retl:, pages405-4 11,199O. [Fa185] Christos Faloutsos. Accessmethodsfor text. ACM Computing Surveys,17:X%74,1985. [FJ92a] Christos Faloutsosand H. V. Jagadish. Hybrid index organizationsfor text databases.In Ptoc. of the 3rd Inter: Conf on Extending Database Technology,pages 310-327,1992. [FJ92b] Christos Faloutsosand H. V. Jagadish. On b-tree indices for skeweddistributions. In Ptoc. of the 18th Intel: Conf on VLDB, pages363-374, Vancouver, 1992. u=Jll Edward A. Fox and Whay C. Lee. FAST-INV: A fast algorithm for building large inverted files. Technical Report TR-91-10, VPI&SU Departmentof Computer Science,Blacksburg, VA, March 1991. [Hat941 Donna Harman. editor. The SecondText REtrieval Conference(TREC2). National Institute of Standardsand Technology Special Publication, Gaithersburg,MD, 1994. [Hea78] H. S. Heaps. Information Retrieval, Computational and Theoretical Aspects. Academic Press,Inc., New York, 1978. [HFBYL92] Donna Harman, Edward Fox, Ricardo Baeza-Yates,and Whay Lee. Inverted files. In William B. Frakes and Ricardo Baeza-Yates,editors, Information Retrieval: Data Structures & Algorithms, chapter 3, pages28-43. Prentice Hall, Englewood Cliffs, NJ, 1992. [Mos90] J. Eliot B. Moss. Design of the Mneme persistent object store. ACM Trans. Inf Syst.,8(2):103-139, April 1990. [MZ94a] Alistair Moffat and Justin Zobel. Compressionand fast indexing for multi-gigabyte text databases.Australian Comput.I., 26( 1):l-9, February 1994. FIZ94bl Alistair Moffat and Justin Zobel. Self-indexing inverted files for fast text retrieval. Technical Report 94/2, Collaborative Information Technology ResearchInstitute, Departmentof Computer Science,Royal Melbourne Institute of Technology,Australia, February 1994. [SM83] Gerard Salton and Michael J. McGill. Introduction to Modem Information Retrieval. McGraw-Hill, New York, 1983. [STGM94] Kurt Shoens,Anthony Tomasic,and Hector Garcia-Molina. Synthetic workload performanceanalysis of incremental updates.In Ptoc. of the 17th Intel: ACM SIGIR Conf on Res.and Develop. in Infol: Retr, Dublin, July 1994. [TC91] Howard Turtle and W. Bruce Croft. Evaluation of an inference network-based retrieval model. ACM Trans.Inf Syst., 9(3):187-222, July 1991. W921 Howard R. Turtle and W. &We Croft. A comparisonof text retrieval models. Comput..I., 35(3):279-290,1992. [TGMS94] Anthony Tomasic,Hector Garcia-Molina, and Kurt Shoens.Incremental updatesof inverted lists for text documentretrieval. In Proc. of the ACM SIGMOD Inter: Conf on Managementof Data, Minneapolis, MN, May 1994. [Wi184] 201 PeterWillett. A nearestneighbour search algorithm for bibliographic retrieval from multilist files. Inform. Tech.,3(2):78-83, April 1984. [wL931 Wai Yee PeterWong and Dik Lun Lee. Implementations of partial documentranking using inverted files. Inf: Process. & Mgmnt., 29(5):647-669,1993. [WMB94] Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, New York, 1994. [Wo192a] Dietmar Wolfram. Applying informetric characteristicsof databasesto IR systemfile design, Part I: informetric models. Znf. Process. & Mgmnt., 28(1):121-133,1992. [Wo192b] Dietmar Wolfram. Applying informetric characteristicsof databasesto IR systemfile design, Part II: simulation comparisons.Znf: Process. & Mgmnt., 28(1):135-151,1992. [Zip491 GeorgeKingsley Zipf. Human Behavior and the Principle of Least Effort. Addison-WesleyPress,1949. [ZMSD92] Justin Zobel, Alistair Moffat, and Ron Sacks-Davis.An efficient indexing technique for full-text databasesystems.In Proc. of the 18th Intel: Co& on VLDB, pages352-362, Vancouver, 1992. 202