CS 276A / LING 239I |
MG = Managing Gigabytes, by I. Witten, A. Moffat, and T. Bell.
MIR = Modern Information Retrieval, by R. Baeza-Yates and
B. Ribeiro-Neto.
FOA = Finding Out About, by R. Belew.
FSNLP = Foundations of Statistical Natural Language Processing,
by C. Manning and H. Schütze.
Information Retrieval: Algorithms and Heuristics by D. Grossman and
O. Frieder
These books all have useful information on topics that we cover and are recommended as references. None of them are required textbooks or cover all of the material in this course. MG is particularly good for technical IR in the first half of the course.
CM = Christopher Manning
PR = Prabhakar Raghavan
LE = Louis Eisenberg
DG = Daniel Gindikin
All lectures, exams and review sessions are in Gates B01.
Lectures are Tuesdays and Thursdays from 4:15 to 5:30.
Review sessions, when scheduled, are Friday from 3:15 to 4:05.
Details of the schedule, slides and reading lists will be updated as
the quarter progresses. Assignment dates are subject to change.
We will schedule review sessions for assignments, probably on Friday afternoons.
Date | Topics | Notes | Who | Readings | Assignments |
---|---|---|---|---|---|
Tue 28 Sep | IR 1. Introduction to Information Retrieval. Inverted indices and boolean queries. Query optimization. The nature of unstructured and semi-structured text. Course administrivia. |
[powerpoint] [PDF] |
CM |
MG Sec. 3.0-3.2; MIR Sec. 8.1-8.2 Shakespeare plays |
|
Thu 30 Sep | IR 2. Text encoding: tokenization, stemming, lemmatization, stop words, phrases. Further optimizing indices for query processing. Proximity and phrase queries. Positional indices. |
[powerpoint] [PDF] |
PR | MG Ch. 3.7, 4.2-4.3; MIR Ch. 7.2 (8.5.7) Porter's stemmer Porter from the author Lovins stemmer Fast Phrase Querying with Combined Indexes |
|
Tue 5 Oct | IR 3. Index compression: lexicon compression and postings lists compression. Gap encoding, gamma codes, Zipf's Law. Blocking. Extreme compression. |
[powerpoint] [PDF] |
PR | MG 3.3, 3.4 Compression of inverted indexes for fast query evaluation Explanation of Elias gamma codes (note that in our class unary codes are flipped from what's on this page, i.e. ones followed by a zero instead of zeroes followed by a one) |
PS1 out [doc] [PDF] |
Thu 7 Oct | IR 4. Query expansion: spelling correction and synonyms. Wild-card queries, permuterm indices, n-gram indices. Edit distance, soundex, language detection. |
[powerpoint] [PDF] |
PR | MG Ch. 5; J. Zobel and P. Dart. Finding approximate matches in large lexicons. Software - practice and experience 25(3), March 1995. |
|
Fri 8 Oct | Review session for PS1 |
[doc] [PDF] |
LE | ||
Tue 12 Oct | IR 5. Index construction. Postings size estimation, merge sort, dynamic indexing, positional indexes, n-gram indexes, real-world issues. |
[powerpoint] [PDF] |
PR |
MG Ch. 4.4-4.6; MIR 2.5, 2.7.2 |
PE1 out [doc] [PDF] |
Thu 14 Oct | IR 6. Parametric or fielded search. Document zones. The vector space retrieval model. tf.idf weighting. Scoring documents. |
[powerpoint] [PDF] |
PR | PS1 due | |
Fri 15 Oct | Review session for PE1 |
[doc] [PDF] |
LE | ||
Tue 19 Oct | IR 7. Vector space scoring. The cosine measure. Efficiency considerations. Nearest neighbor techniques, reduced dimensionality approximations, random projection. |
[powerpoint] [PDF] |
PR |
MG 4.5, MIR Ch. 3 |
|
Thu 21 Oct | IR 8. Results summaries: static and dynamic. Evaluating search engines. User happiness, precision, recall, F-measure. Creating test collections: kappa measure, interjudge agreement. Relevance, approximate vector retrieval. |
[powerpoint] [PDF] |
CM |
Automatic
Word Sense Discrimination |
PE1 due |
Fri 22 Oct | Review session for midterm |
[doc] [PDF] |
LE | ||
Tue 26 Oct | Midterm to be held in-class |
[doc] [PDF] |
|||
Thu 28 Oct | IR 9. Relevance feedback. Pseudo relevance feedback. Query expansion. Automatic thesaurus generation. Sense-based retrieval. Experimental results of performance effectiveness. |
[powerpoint] [PDF] |
CM |
MG Ch 4.7; MIR Ch 5.2-5.4 Learning Routing Queries in a Query Zone Concept Based Query Expansion New Retrieval Approaches Using SMART: TREC 4 Gerard Salton and Chris Buckley. Improving Retrieval Performance by Relevance Feedback. Journal of the American Society for Information Science, 41(4):288-297, 1990. Jinxi Xu and W. Bruce Croft: Query Expansion Using Local and Global Document Analysis. SIGIR 1996: 4-11 |
|
Tue 2 Nov | PROBABILITIES 1. Probabilistic models for text problems. Classical probabilistic IR. Language models. |
[powerpoint] [PDF] |
CM |
MIR 2.5.4, 2.8 C. J. van Rijsbergen, Information Retrieval, ch. 6 Probabilistic Models in Information Retrieval (Fuhr 92) E. Charniak, Bayesian Networks without Tears H. Turtle's dissertation |
|
Thu 4 Nov | PROBABILITIES 2. Introduction to text classification. Naive Bayes models. Spam filtering. |
[powerpoint] [PDF] |
CM |
Machine
Learning in Automated Text Categorization A Comparison of Event Models for Naive Bayes Text Classification Tom Mitchell. Machine Learning. McGraw-Hill, 1997. A Re-examination of Text Categorization Methods A Plan for Spam (Paul Graham). Better Bayesian Filtering (Paul Graham). 2003 Spam Conference proceedings |
|
Tue 9 Nov | PROBABILITIES 3. Probabilistic language models for IR. Bayesian nets for IR. |
[powerpoint] [PDF] |
CM |
J.M. Ponte and
W.B. Croft. 1998 A. Berger and J. Lafferty. 1999 Workshop on Language Modeling and Information Retrieval, CMU, 2001 The Lemur Toolkit for Language Modeling and Information Retrieval |
PS2 out [doc] [PDF] |
Thu 11 Nov | CLUSTERING 1. Introduction to the problem. Partitioning methods. k-means clustering. Mixture of gaussians model. Clustering versus classification. |
[powerpoint] [PDF] |
PR |
Scatter/Gather Data Clustering Review Initialization of iterative refinement clusting algorithms Scaling Clustering Algorithms to Large Databases |
|
Fri 12 Nov | Review session for PS2 |
[doc] [PDF] |
DG | ||
Tue 16 Nov | CLUSTERING 2. Hierarchical agglomerative clustering. Clustering terms using documents. Labelling clusters. Evaluating clustering. Text-specific issues. |
[powerpoint] [PDF] |
PR | FSNLP Ch. 13 Single-Link and Complete-Link Clustering | PE2 out [doc] [PDF] |
Thu 18 Nov | CLUSTERING 3. Reduced dimensionality/spectral methods. Latent semantic indexing (LSI). Applications to clustering and to information retrieval. |
[powerpoint] [PDF] |
PR |
FSNLP 15.4 Projections for Efficient Document Clustering Spectral Analysis of Data Latent semantic indexing Random projection theorem Faster random projection |
PS2 due |
Fri 19 Nov | Review session for PE2 |
[doc] [PDF] |
LE, DG | ||
Tue 23 Nov | CLASSIFICATION 1. Vector space classification using hyperplanes; centroids; k Nearest Neighbors. |
[powerpoint] [PDF] |
CM | FSNLP Ch. 16 A Comparative Study on Feature Selection in Text Categorization Evaluating and Optimizing Autonomous Text Classification Systems Dumais, Platt, Heckerman, and Sahami. 1998. Inductive learning algorithms and representations for text categorization. CIKM 1998. Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2001. Reuters dataset |
|
Thu 25 Nov | Thanksgiving (no class) | ||||
Tue 30 Nov | CLASSIFICATION 2. Support vector machine classifiers. Kernel functions. |
[powerpoint] [PDF] |
CM | FSNLP Ch. 16 A Tutorial on Support Vector Machines for Pattern Recognition Conceptual Information Retrieval Using RUBRIC Dumais. Using SVMs for Text Categorization. IEEE Intelligent Systems 13(4) Jul-Aug 1998. Text Categorization Based on Regularized Linear Classification Methods | PE2 due |
Thu 2 Dec | CLASSIFICATION 3. Text classification. Exploiting text-specific features. Feature selection. Evaluation of classification. Micro- and macro-averaging. Comparative results. |
[powerpoint] [PDF] |
CM |
Machine
Learning in Automated Text Categorization A Re-examination of Text Categorization Methods A Comparative Study on Feature Selection in Text Categorization |
|
Fri 3 Dec | Review session for final |
[doc] [PDF] |
LE, DG | ||
Thu 9 Dec | Final Exam 12:15pm-3:15pm |