Table of Contents
- List of Tables
- List of Figures
- Table of Notation
- Preface
- Boolean retrieval
- An example information retrieval problem
- A first take at building an inverted index
- Processing Boolean queries
- The extended Boolean model versus ranked retrieval
- References and further reading
- The term vocabulary and postings lists
- Document delineation and character sequence decoding
- Obtaining the character sequence in a document
- Choosing a document unit
- Determining the vocabulary of terms
- Tokenization
- Dropping common terms: stop words
- Normalization (equivalence classing of terms)
- Stemming and lemmatization
- Faster postings list intersection via skip pointers
- Positional postings and phrase queries
- Biword indexes
- Positional indexes
- Combination schemes
- References and further reading
- Dictionaries and tolerant retrieval
- Search structures for dictionaries
- Wildcard queries
- General wildcard queries
- k-gram indexes for wildcard queries
- Spelling correction
- Implementing spelling correction
- Forms of spelling correction
- Edit distance
- k-gram indexes for spelling correction
- Context sensitive spelling correction
- Phonetic correction
- References and further reading
- Index construction
- Hardware basics
- Blocked sort-based indexing
- Single-pass in-memory indexing
- Distributed indexing
- Dynamic indexing
- Other types of indexes
- References and further reading
- Index compression
- Statistical properties of terms in information retrieval
- Heaps' law: Estimating the number of terms
- Zipf's law: Modeling the distribution of terms
- Dictionary compression
- Dictionary as a string
- Blocked storage
- Postings file compression
- Variable byte codes
- Gamma codes
- References and further reading
- Scoring, term weighting and the vector space model
- Parametric and zone indexes
- Weighted zone scoring
- Learning weights
- The optimal weight g
- Term frequency and weighting
- Inverse document frequency
- Tf-idf weighting
- The vector space model for scoring
- Dot products
- Queries as vectors
- Computing vector scores
- Variant tf-idf functions
- Sublinear tf scaling
- Maximum tf normalization
- Document and query weighting schemes
- Pivoted normalized document length
- References and further reading
- Computing scores in a complete search system
- Efficient scoring and ranking
- Inexact top K document retrieval
- Index elimination
- Champion lists
- Static quality scores and ordering
- Impact ordering
- Cluster pruning
- Components of an information retrieval system
- Tiered indexes
- Query-term proximity
- Designing parsing and scoring functions
- Putting it all together
- Vector space scoring and query operator interaction
- References and further reading
- Evaluation in information retrieval
- Information retrieval system evaluation
- Standard test collections
- Evaluation of unranked retrieval sets
- Evaluation of ranked retrieval results
- Assessing relevance
- Critiques and justifications of the concept of relevance
- A broader perspective: System quality and user utility
- System issues
- User utility
- Refining a deployed system
- Results snippets
- References and further reading
- Relevance feedback and query expansion
- Relevance feedback and pseudo relevance feedback
- The Rocchio algorithm for relevance feedback
- Probabilistic relevance feedback
- When does relevance feedback work?
- Relevance feedback on the web
- Evaluation of relevance feedback strategies
- Pseudo relevance feedback
- Indirect relevance feedback
- Summary
- Global methods for query reformulation
- Vocabulary tools for query reformulation
- Query expansion
- Automatic thesaurus generation
- References and further reading
- XML retrieval
- Basic XML concepts
- Challenges in XML retrieval
- A vector space model for XML retrieval
- Evaluation of XML retrieval
- Text-centric vs. data-centric XML retrieval
- References and further reading
- Exercises
- Probabilistic information retrieval
- Review of basic probability theory
- The Probability Ranking Principle
- The 1/0 loss case
- The PRP with retrieval costs
- The Binary Independence Model
- Deriving a ranking function for query terms
- Probability estimates in theory
- Probability estimates in practice
- Probabilistic approaches to relevance feedback
- An appraisal and some extensions
- An appraisal of probabilistic models
- Tree-structured dependencies between terms
- Okapi BM25: a non-binary model
- Bayesian network approaches to IR
- References and further reading
- Language models for information retrieval
- Language models
- Finite automata and language models
- Types of language models
- Multinomial distributions over words
- The query likelihood model
- Using query likelihood language models in IR
- Estimating the query generation probability
- Ponte and Croft's Experiments
- Language modeling versus other approaches in IR
- Extended language modeling approaches
- References and further reading
- Text classification and Naive Bayes
- The text classification problem
- Naive Bayes text classification
- Relation to multinomial unigram language model
- The Bernoulli model
- Properties of Naive Bayes
- A variant of the multinomial model
- Feature selection
- Mutual information
- Chi2 Feature selection
- Frequency-based feature selection
- Feature selection for multiple classifiers
- Comparison of feature selection methods
- Evaluation of text classification
- References and further reading
- Vector space classification
- Document representations and measures of relatedness in vector spaces
- Rocchio classification
- k nearest neighbor
- Time complexity and optimality of kNN
- Linear versus nonlinear classifiers
- Classification with more than two classes
- The bias-variance tradeoff
- References and further reading
- Exercises
- Support vector machines and machine learning on documents
- Support vector machines: The linearly separable case
- Extensions to the SVM model
- Soft margin classification
- Multiclass SVMs
- Nonlinear SVMs
- Experimental results
- Issues in the classification of text documents
- Choosing what kind of classifier to use
- Improving classifier performance
- Machine learning methods in ad hoc information retrieval
- A simple example of machine-learned scoring
- Result ranking by machine learning
- References and further reading
- Flat clustering
- Clustering in information retrieval
- Problem statement
- Cardinality -- the number of clusters
- Evaluation of clustering
- K-means
- Cluster cardinality in K-means
- Model-based clustering
- References and further reading
- Exercises
- Hierarchical clustering
- Hierarchical agglomerative clustering
- Single-link and complete-link clustering
- Group-average agglomerative clustering
- Centroid clustering
- Optimality of HAC
- Divisive clustering
- Cluster labeling
- Implementation notes
- References and further reading
- Exercises
- Matrix decompositions and latent semantic indexing
- Linear algebra review
- Term-document matrices and singular value decompositions
- Low-rank approximations
- Latent semantic indexing
- References and further reading
- Web search basics
- Background and history
- Web characteristics
- Advertising as the economic model
- The search user experience
- Index size and estimation
- Near-duplicates and shingling
- References and further reading
- Web crawling and indexes
- Overview
- Features a crawler must provide
- Features a crawler should provide
- Crawling
- Crawler architecture
- DNS resolution
- The URL frontier
- Distributing indexes
- Connectivity servers
- References and further reading
- Link analysis
- The Web as a graph
- Anchor text and the web graph
- PageRank
- Markov chains
- The PageRank computation
- Topic-specific PageRank
- Hubs and Authorities
- Choosing the subset of the Web
- References and further reading
- Bibliography
- Author Index