OnDoc

OnDoc | sign in up ~ pub | pdf txt html toc url | << >> 1 ~ 574 575 576 577 578 579 580 581

List of Tables
List of Figures
Table of Notation
Preface
Boolean retrieval
- An example information retrieval problem
- A first take at building an inverted index
- Processing Boolean queries
- The extended Boolean model versus ranked retrieval
- References and further reading
The term vocabulary and postings lists
- Document delineation and character sequence decoding
  - Obtaining the character sequence in a document
  - Choosing a document unit
- Determining the vocabulary of terms
  - Tokenization
  - Dropping common terms: stop words
  - Normalization (equivalence classing of terms)
  - Stemming and lemmatization
- Faster postings list intersection via skip pointers
- Positional postings and phrase queries
  - Biword indexes
  - Positional indexes
  - Combination schemes
- References and further reading
Dictionaries and tolerant retrieval
- Search structures for dictionaries
- Wildcard queries
  - General wildcard queries
  - k-gram indexes for wildcard queries
- Spelling correction
  - Implementing spelling correction
  - Forms of spelling correction
  - Edit distance
  - k-gram indexes for spelling correction
  - Context sensitive spelling correction
- Phonetic correction
- References and further reading
Index construction
- Hardware basics
- Blocked sort-based indexing
- Single-pass in-memory indexing
- Distributed indexing
- Dynamic indexing
- Other types of indexes
- References and further reading
Index compression
- Statistical properties of terms in information retrieval
  - Heaps' law: Estimating the number of terms
  - Zipf's law: Modeling the distribution of terms
- Dictionary compression
  - Dictionary as a string
  - Blocked storage
- Postings file compression
  - Variable byte codes
  - Gamma codes
- References and further reading
Scoring, term weighting and the vector space model
- Parametric and zone indexes
  - Weighted zone scoring
  - Learning weights
  - The optimal weight g
- Term frequency and weighting
  - Inverse document frequency
  - Tf-idf weighting
- The vector space model for scoring
  - Dot products
  - Queries as vectors
  - Computing vector scores
- Variant tf-idf functions
  - Sublinear tf scaling
  - Maximum tf normalization
  - Document and query weighting schemes
  - Pivoted normalized document length
- References and further reading
Computing scores in a complete search system
- Efficient scoring and ranking
  - Inexact top K document retrieval
  - Index elimination
  - Champion lists
  - Static quality scores and ordering
  - Impact ordering
  - Cluster pruning
- Components of an information retrieval system
  - Tiered indexes
  - Query-term proximity
  - Designing parsing and scoring functions
  - Putting it all together
- Vector space scoring and query operator interaction
- References and further reading
Evaluation in information retrieval
- Information retrieval system evaluation
- Standard test collections
- Evaluation of unranked retrieval sets
- Evaluation of ranked retrieval results
- Assessing relevance
  - Critiques and justifications of the concept of relevance
- A broader perspective: System quality and user utility
  - System issues
  - User utility
  - Refining a deployed system
- Results snippets
- References and further reading
Relevance feedback and query expansion
- Relevance feedback and pseudo relevance feedback
  - The Rocchio algorithm for relevance feedback
  - Probabilistic relevance feedback
  - When does relevance feedback work?
  - Relevance feedback on the web
  - Evaluation of relevance feedback strategies
  - Pseudo relevance feedback
  - Indirect relevance feedback
  - Summary
- Global methods for query reformulation
  - Vocabulary tools for query reformulation
  - Query expansion
  - Automatic thesaurus generation
- References and further reading
XML retrieval
- Basic XML concepts
- Challenges in XML retrieval
- A vector space model for XML retrieval
- Evaluation of XML retrieval
- Text-centric vs. data-centric XML retrieval
- References and further reading
- Exercises
Probabilistic information retrieval
- Review of basic probability theory
- The Probability Ranking Principle
  - The 1/0 loss case
  - The PRP with retrieval costs
- The Binary Independence Model
  - Deriving a ranking function for query terms
  - Probability estimates in theory
  - Probability estimates in practice
  - Probabilistic approaches to relevance feedback
- An appraisal and some extensions
  - An appraisal of probabilistic models
  - Tree-structured dependencies between terms
  - Okapi BM25: a non-binary model
  - Bayesian network approaches to IR
- References and further reading
Language models for information retrieval
- Language models
  - Finite automata and language models
  - Types of language models
  - Multinomial distributions over words
- The query likelihood model
  - Using query likelihood language models in IR
  - Estimating the query generation probability
  - Ponte and Croft's Experiments
- Language modeling versus other approaches in IR
- Extended language modeling approaches
- References and further reading
Text classification and Naive Bayes
- The text classification problem
- Naive Bayes text classification
  - Relation to multinomial unigram language model
- The Bernoulli model
- Properties of Naive Bayes
  - A variant of the multinomial model
- Feature selection
  - Mutual information
  - Chi2 Feature selection
  - Frequency-based feature selection
  - Feature selection for multiple classifiers
  - Comparison of feature selection methods
- Evaluation of text classification
- References and further reading
Vector space classification
- Document representations and measures of relatedness in vector spaces
- Rocchio classification
- k nearest neighbor
  - Time complexity and optimality of kNN
- Linear versus nonlinear classifiers
- Classification with more than two classes
- The bias-variance tradeoff
- References and further reading
- Exercises
Support vector machines and machine learning on documents
- Support vector machines: The linearly separable case
- Extensions to the SVM model
  - Soft margin classification
  - Multiclass SVMs
  - Nonlinear SVMs
  - Experimental results
- Issues in the classification of text documents
  - Choosing what kind of classifier to use
  - Improving classifier performance
- Machine learning methods in ad hoc information retrieval
  - A simple example of machine-learned scoring
  - Result ranking by machine learning
- References and further reading
Flat clustering
- Clustering in information retrieval
- Problem statement
  - Cardinality -- the number of clusters
- Evaluation of clustering
- K-means
  - Cluster cardinality in K-means
- Model-based clustering
- References and further reading
- Exercises
Hierarchical clustering
- Hierarchical agglomerative clustering
- Single-link and complete-link clustering
  - Time complexity of HAC
- Group-average agglomerative clustering
- Centroid clustering
- Optimality of HAC
- Divisive clustering
- Cluster labeling
- Implementation notes
- References and further reading
- Exercises
Matrix decompositions and latent semantic indexing
- Linear algebra review
  - Matrix decompositions
- Term-document matrices and singular value decompositions
- Low-rank approximations
- Latent semantic indexing
- References and further reading
Web search basics
- Background and history
- Web characteristics
  - The web graph
  - Spam
- Advertising as the economic model
- The search user experience
  - User query needs
- Index size and estimation
- Near-duplicates and shingling
- References and further reading
Web crawling and indexes
- Overview
  - Features a crawler must provide
  - Features a crawler should provide
- Crawling
  - Crawler architecture
  - DNS resolution
  - The URL frontier
- Distributing indexes
- Connectivity servers
- References and further reading
Link analysis
- The Web as a graph
  - Anchor text and the web graph
- PageRank
  - Markov chains
  - The PageRank computation
  - Topic-specific PageRank
- Hubs and Authorities
  - Choosing the subset of the Web
- References and further reading
Bibliography
Author Index

Table of Contents