Indexing and Searching

Introduction
How to retrieval information?
A simple alternative is to search the whole
text sequentially
Another option is to build data structures
over the text (called indices) to speed up
the search

Introduction
Indexing techniques:
Inverted files
Suffix arrays
Signature files

Notation
n: the size of the text
m: the length of the pattern
v: the size of the vocabulary
M: the amount of main memory available

Inverted Files
an inverted file is a word-oriented
mechanism for indexing a text collection
an inverted file is composed of vocabulary
and the occurrences
occurrences - set of lists of the text
positions where the word appears

Searching
The search algorithm on an inverted
index follows three steps:
Vocabulary search: the words present in

the query are searched in the vocabulary
Retrieval occurrences: the lists of the
occurrences of all words found are retrieved
Manipulation of occurrences: the
occurrences are processed to solve the
query

Searching
 Searching task on an inverted file always starts in
the vocabulary ( It is better to store the vocabulary in
a separate file )
 The structures most used to store the vocabulary
are hashing, tries or B-trees
 An alternative is simply storing the words in
lexicographical order ( cheaper in space and very
competitive with O(log v) cost )

Construction
building and maintaining an inverted index
is a low cost task
an inverted index on a text of n characters
can be built in O(n) time
all the vocabulary is kept in a trie data
structure
for each word, a list of its occurrences
(test positions) is stored

Construction
each word of is read and searched in the
vocabulary

Other Indices for Text
Suffix trees and suffix arrays
Signature files

Suffix Trees and Suffix Arrays
Inverted indices assumes that the text can
be seen as a sequence of words
Restricts the kinds of queries that can be
answered
Other queries, ex. phrases, are expensive to
solve

Problem: the concept of word does not exist
in some applications, e.g. genetic
database

Suffix Trees
A suffix tree is
a trie data structure
built over all the suffixes of the text
(pointers to the suffixes are stored at the leaf nodes)

compacted into a Patricia tree, to improve
space utilization
The tree has O(n) nodes instead of O(n2) nodes of the
trie

Suffix Arrays
a space efficient implementation of suffix
trees
allow user to answer more complex
queries efficiently
main drawbacks
costly construction process
text must be available at query time
results are not delivered in text position order

Suffix Arrays
Can be used to index only words as the
inverted index (unless complex queries are
important, inverted files performs better)

a text suffix is a position in the text (each
suffix is uniquely identified by its position)

index points, which point to the beginning of the text
positions, are selected from the text
elements which are not indexed are not
retrievable

Suffix Arrays: Structure
Suffix tree problem: space requirement
Each node of the trie takes 12 to 24 bytes
-> space overhead of 120% to 240% over the text
size is produced

Suffix array is an array containing all the
pointers to the text suffixes listed in
lexicographical order
requires much less space

Suffix Arrays: Structure
are designed to allow binary searches (by
comparing the contents of each pointer)

for large suffix arrays, binary search can
perform poorly (due to the number of random
disk accesses)

one solution is to use supra-indices over
suffix array

Suffix Arrays: Searching
basic pattern (ex. words, prefixes, and
phrases) search on a suffix tree can be
done in O(m) time by a simple trie search
binary search on suffix arrays can be done
in O(log n) time

Suffix Arrays: Construction
a suffix tree for a text of n characters can
be built in O(n) time
if the suffix tree does not fit in main
memory, the algorithm performs poorly

Suffix Arrays: Construction
an algorithm to build the suffix array in O(n
log n) character comparisons
suffixes are bucket-sorted in O(n) time
according to the first letter only
each bucket is bucket-sorted again according to
the first two letters

Suffix Arrays: Construction
at iteration i the suffixes begin already sorted by
their 2i-1 first letters and end up sorted by their
first 2i letters
each iteration, the total cost of all the bucket
sorts is O(n), the total time is O(n log n), and
the average is O(n log log n) (O(log n)
comparisons are necessary on average to
distinguish two suffixes of a text)

Suffix Arrays: Construction
 large text will not fit in main memory
 split large text into blocks that can be sorted in
main memory
 build suffix array in main memory and merge it
the the rest of the array already built for the
previous text
 compute counters to store information of how
many suffixes of the large text lie between each
pair of positions of the suffix array

Signature Files
Signature files are word-oriented index
structures based on hashing
low overhead (10%-20% over the text size)
forcing a sequential search over the index
suitable for not very large texts
Nevertheless, inverted files outperform signature
files for most applications

Signature Files: Structure
A signature file uses a hash function (or
‘signature’) that maps words to bit masks
of B bits
the text is devided in blocks of b words
each
to each text block of size b, a bit mask of
size B is assigned
this mask is obtained by bitwise ORing the
signatures of all the words in the text block

Signature Files: Structure
 a signature file is the sequence of bit masks of
all blocks (plus a pointer to each block)
 the main idea is:
if a word is present in a text block, then all the bits set in
its signature are also set in the bit mask of the text block
false drop: all the corresponding bits are set even
though the word is not in the block

 to ensure low probability of a false drop and to
keep the signature file as short as possible

Signature Files: Searching
Searching a single word is done by
1. hashing that word to a bit mask W
2. comparing the bit masks Bi of all the text
blocks
3. whenever (W & Bi = W), the text block may
contain the word
4. online traversal must be performed on all the
candidate text blocks to verify if the word is
there

Signature Files: Searching
This scheme is efficient to search phrases
and reasonable proximity queries

Signature Files: Construction
Constructing a signature file:
1. cut text into blocks
2. generate an entry of the signature file for each
block
 this entry is the bitwise OR of the signatures of all the
words in the block

Adding text: adding records to the signature file
Deleting text: deleting the appropriate bit masks

Signature Files: Construction
Storage proposals:
1. store all the bit masks in sequence
2. make a different file for each bit of the
mask
(ex. one file holding all the first bits, another file for all
the second bits)

result: the reduction in the disk times to search
for a query
(only the files corresponding to the l bits, which are set
in the query, have to be traversed)

Boolean Queries
 the set manipulation algorithms are used
when operating on sets of results
 the search proceeds in three phases:
1. determines which documents classify
2. determines the relevance of the classifying
documents (to present them to the user)
3. retrieves the exact positions of the matches in
those documents (that the user wants to see)

Sequential Searching
when no data structure has been built on the text
1. brute-force (bf) algorithm
2. Knuth-Morris-Pratt
3. Boyer-Moore Family
4. Shift-Or
5. Suffix Automaton
6. Practical Comparison
7. Phrases and Proximity

Brute Force (BF)
Trying all possible pattern positions in the text
 a window of length m is slid over the text
 if the text in the window is equal to the pattern,
report as a match
 shift the window forward

 worst-case is O(mn)
 average case is O(n)
 does not need any pattern preprocessing

Knuth-Morris-Pratt (KMP)
linear worst-case behavior O(2n)
average case is not much faster than BF
slide a window over the text
does not try all window positions
reuses information from previous checks

the pattern is preprocessed in O(m) time

Boyer-Moore Family
the check inside the window can proceed
backwards
for every pattern position j the next-to-last
occurrence of Pj..m inside P called match
heuristic
occurrence heuristic: the text character
that produced the mismatch has to be
aligned with the same character in the
pattern after the shift

Boyer-Moore Family
proprocessing time and space of this
algorithm: O(m+σ)
search time is O(n log (m)/m) on average
worst case is O(mn)

Shift-Or
 uses bit-parallelism to simulate the operation of
a non-deterministic automaton that searches the
pattern in the text
 builds a table B which stores a bit mask bm...b1
for each character
 update D using the formula
D’ ← (D<<1) | B[Tj]
 O(n) on average, preprocessing is O(m+σ)
 O(σ) space

Suffix Automaton
The Backward DAWG matching (BDM) is
based on a suffix automaton
a suffix automaton on a pattern P is an
automaton that recognizes all the suffixes
of P
the BDM algorithm converts a nondeterministic suffix automaton to
deterministic

Suffix Automaton
the size and construction time is O(m)
build a suffix automaton Pr (the reversed
pattern) to search for pattern P
a match is found if the complete window is
read
O(mn) time for worst case
O(n log(m)/m) on average

Practical Comparison
Figure 8.20 compares string matching
algorithms on:
English text from the TREC collection
DNA
random text uniformly generated over 64 letters
patterns were randomly selected from the text,
except for random text, patterns are randomly
generated

Phrases and Proximity
if the sequence of words is searched as
appear in text, the problem is similar to
search of a single pattern
to search a phrase element-wise is to
search for the element which is less
frequent or can be searched faster, then
check the neighboring words
same for proximity query search

Pattern Matching
Two main groups of techniques to deal with
complex patterns:
1. searching allowing errors
2. searching for extended patterns

Searching Allowing Errors
Approximate string matching problem: given
a short pattern P of length m, a long text T of
length n, and a maximum allowed number of
errors k, find all the text positions where the
pattern occurs with at most k errors

The main approaches for solution: dynamic
programming, automaton, bit-parallelism and
filtering

Dynamic Programming
 the classical solution to approximate string
matching
 fill a matrix C column by column where C[i,j]
represents the minimum numbers of errors
needed to match P1..i to a suffix of T1..j
C[0,j] = 0
C[i,0] = i
C[i,j] = if (Pi=Tj) then C[i-1,j-1]
else 1+min(C[i-1,j],C[i,j-1],C[i-1,j-1]

Automaton
reduce the problem to a non-deterministic
finite automaton (NFA)
each row of the NFA denotes the number
of errors seen
every column represents matching the
pattern up to a giver position

Automaton
 horizontal arrows represent matching a
character
 vertical arrows represent insertions into the
pattern
 solid diagonal arrows represent replacements
 dashed diagonal arrows represent deletions in
the pattern
 the automaton accepts a text position as the end
of a match with k errors whenever the (k+1)-th
rightmost state is active

Bit-Parallelism
use bit-parallelism to parallelize the
computation of the dynamic programming
matrix

Filtering
reduce the area of the text where dynamic
programming needs to be used by filter
the text first

Regular Expressions and Extended
Patterns
 build a non-deterministic finite automaton of size
O(m)
m: the length of the regular expression

 convert this automaton to deterministic form
search any regular expression in O(n) time
its size and construction time can be exponential in m,
i.e. O(m2m)

 bit-parallelism has been proposed to avoid the
construction of the deterministic automaton

Pattern Matching Using Indices
how to extend the indexing techniques for
simple searching for more complex
patterns
inverted files:
using a sequential search over the vocabulary
and merge their lists of occurrences to retrieve
a list of documents and the matching text
positions

Pattern Matching Using Indices
suffix tree and suffix array:
if all text positions are indexed, words, prefixes,
suffixes and substrings can be searched with
the same search algorithm and cost -> 10 to 20
times text size for index
range queries can be done by searching both
extremes in the trie and collects all leaves in the
middle

Pattern Matching Using Indices
regular expression search and unrestricted
approximate string matching can be done by
sequential searching
other complex searches that can be done are:
find the longest substring in the text that
appears more than once, find the most common
substring of a fixed size
suffix array implementation reduce operation
cost from C(n) (on suffix tree) to O(C(n) log n)
cost

Structural Queries
The algorithms to search on structured text
are dependent on each model
Some considerations are:
how to store the structural information
build and ad hoc index to store the structure
(efficient and independent of the text)
mark the structure in the text using ‘tags’
(efficient in many cases, integration into an
existing text database is simpler)

Compression
Sequential searching
Compressed indices
Inverted files
Suffix trees and suffix arrays
Signature files

Sequential Searching
Huffman coding technique allows directly
searching compressed text
Boyer-Moore filtering can be used to
speed up the search in Huffman coding
trees
Phrase searching can be accomplished
using the i-th bit mask with the i-th element
of the phrase query + Shift-Or algorithm
(simple and efficient)

Compressed Indices:
Inverted Files
the lists of occurrences are in increasing
order of text position
the differences between the previous
position and the current one can be
represented using less space by using
techniques that favor small numbers
the text can also be compressed
independently of the index

Compressed Indices:
Suffix Trees and Suffix Arrays
 reduction of space requirements -> more
expensive searching
 reduced space requirements are similar to
those of uncompressed suffix arrays at much
smaller performance penalties
 suffix arrays are hard to compress because
they represent an almost perfectly random
permutation of the pointers to the text

Compressed Indices:
Suffix Trees and Suffix Arrays
 building suffix arrays on compressed text:
reduce space requirements
index construction and querying almost double
performance
construction is faster because more compressed text
fits in the same memory space -> fewer text blocks
are needed

 Hu-Tucker coding allows direct binary search
over the compressed text

Compressed Indices:
Signature Files
 compression techniques are based on the fact
that only a few bits are set in the whole file
 possible to use efficient methods to code the
bits which are not set (ex. run-length encoding)
 different considerations arise if the file is stored
as a sequence of bit masks or with one file per
bit of the mask
 advantages are
reduce space and disk times
increase B (reduce the false drop probability)

Trends and Research Issues
The main trends in indexing and searching
textual database are
 text collections are becoming huge
 searching is becoming more complex
 compression is becoming a star in the
field