CS 276A / LING 239I
Text Retrieval and Mining
Autumn 2004

Course Syllabus

Books:

MG = Managing Gigabytes, by I. Witten, A. Moffat, and T. Bell.
MIR = Modern Information Retrieval, by R. Baeza-Yates and B. Ribeiro-Neto.
FOA = Finding Out About, by R. Belew.
FSNLP = Foundations of Statistical Natural Language Processing, by C. Manning and H. Schütze.
Information Retrieval: Algorithms and Heuristics by D. Grossman and O. Frieder

These books all have useful information on topics that we cover and are recommended as references. None of them are required textbooks or cover all of the material in this course. MG is particularly good for technical IR in the first half of the course.

Lecturers:

CM = Christopher Manning
PR = Prabhakar Raghavan
LE = Louis Eisenberg
DG = Daniel Gindikin

Locations and times:

All lectures, exams and review sessions are in Gates B01.
Lectures are Tuesdays and Thursdays from 4:15 to 5:30.
Review sessions, when scheduled, are Friday from 3:15 to 4:05.

Tentative Schedule:

Details of the schedule, slides and reading lists will be updated as the quarter progresses. Assignment dates are subject to change.
We will schedule review sessions for assignments, probably on Friday afternoons.

Date	Topics	Notes	Who	Readings	Assignments
Tue 28 Sep	IR 1. Introduction to Information Retrieval. Inverted indices and boolean queries. Query optimization. The nature of unstructured and semi-structured text. Course administrivia.	[powerpoint] [PDF]	CM	MG Sec. 3.0-3.2; MIR Sec. 8.1-8.2 Shakespeare plays
Thu 30 Sep	IR 2. Text encoding: tokenization, stemming, lemmatization, stop words, phrases. Further optimizing indices for query processing. Proximity and phrase queries. Positional indices.	[powerpoint] [PDF]	PR	MG Ch. 3.7, 4.2-4.3; MIR Ch. 7.2 (8.5.7) Porter's stemmer Porter from the author Lovins stemmer Fast Phrase Querying with Combined Indexes
Tue 5 Oct	IR 3. Index compression: lexicon compression and postings lists compression. Gap encoding, gamma codes, Zipf's Law. Blocking. Extreme compression.	[powerpoint] [PDF]	PR	MG 3.3, 3.4 Compression of inverted indexes for fast query evaluation Explanation of Elias gamma codes (note that in our class unary codes are flipped from what's on this page, i.e. ones followed by a zero instead of zeroes followed by a one)	PS1 out [doc] [PDF]
Thu 7 Oct	IR 4. Query expansion: spelling correction and synonyms. Wild-card queries, permuterm indices, n-gram indices. Edit distance, soundex, language detection.	[powerpoint] [PDF]	PR	MG Ch. 5; J. Zobel and P. Dart. Finding approximate matches in large lexicons. Software - practice and experience 25(3), March 1995.
Fri 8 Oct	Review session for PS1	[doc] [PDF]	LE
Tue 12 Oct	IR 5. Index construction. Postings size estimation, merge sort, dynamic indexing, positional indexes, n-gram indexes, real-world issues.	[powerpoint] [PDF]	PR	MG Ch. 4.4-4.6; MIR 2.5, 2.7.2	PE1 out [doc] [PDF]
Thu 14 Oct	IR 6. Parametric or fielded search. Document zones. The vector space retrieval model. tf.idf weighting. Scoring documents.	[powerpoint] [PDF]	PR		PS1 due
Fri 15 Oct	Review session for PE1	[doc] [PDF]	LE
Tue 19 Oct	IR 7. Vector space scoring. The cosine measure. Efficiency considerations. Nearest neighbor techniques, reduced dimensionality approximations, random projection.	[powerpoint] [PDF]	PR	MG 4.5, MIR Ch. 3
Thu 21 Oct	IR 8. Results summaries: static and dynamic. Evaluating search engines. User happiness, precision, recall, F-measure. Creating test collections: kappa measure, interjudge agreement. Relevance, approximate vector retrieval.	[powerpoint] [PDF]	CM	Automatic Word Sense Discrimination	PE1 due
Fri 22 Oct	Review session for midterm	[doc] [PDF]	LE
Tue 26 Oct	Midterm to be held in-class	[doc] [PDF]
Thu 28 Oct	IR 9. Relevance feedback. Pseudo relevance feedback. Query expansion. Automatic thesaurus generation. Sense-based retrieval. Experimental results of performance effectiveness.	[powerpoint] [PDF]	CM	MG Ch 4.7; MIR Ch 5.2-5.4 Learning Routing Queries in a Query Zone Concept Based Query Expansion New Retrieval Approaches Using SMART: TREC 4 Gerard Salton and Chris Buckley. Improving Retrieval Performance by Relevance Feedback. Journal of the American Society for Information Science, 41(4):288-297, 1990. Jinxi Xu and W. Bruce Croft: Query Expansion Using Local and Global Document Analysis. SIGIR 1996: 4-11
Tue 2 Nov	PROBABILITIES 1. Probabilistic models for text problems. Classical probabilistic IR. Language models.	[powerpoint] [PDF]	CM	MIR 2.5.4, 2.8 C. J. van Rijsbergen, Information Retrieval, ch. 6 Probabilistic Models in Information Retrieval (Fuhr 92) E. Charniak, Bayesian Networks without Tears H. Turtle's dissertation
Thu 4 Nov	PROBABILITIES 2. Introduction to text classification. Naive Bayes models. Spam filtering.	[powerpoint] [PDF]	CM	Machine Learning in Automated Text Categorization A Comparison of Event Models for Naive Bayes Text Classification Tom Mitchell. Machine Learning. McGraw-Hill, 1997. A Re-examination of Text Categorization Methods A Plan for Spam (Paul Graham). Better Bayesian Filtering (Paul Graham). 2003 Spam Conference proceedings
Tue 9 Nov	PROBABILITIES 3. Probabilistic language models for IR. Bayesian nets for IR.	[powerpoint] [PDF]	CM	J.M. Ponte and W.B. Croft. 1998 A. Berger and J. Lafferty. 1999 Workshop on Language Modeling and Information Retrieval, CMU, 2001 The Lemur Toolkit for Language Modeling and Information Retrieval	PS2 out [doc] [PDF]
Thu 11 Nov	CLUSTERING 1. Introduction to the problem. Partitioning methods. k-means clustering. Mixture of gaussians model. Clustering versus classification.	[powerpoint] [PDF]	PR	Scatter/Gather Data Clustering Review Initialization of iterative refinement clusting algorithms Scaling Clustering Algorithms to Large Databases
Fri 12 Nov	Review session for PS2	[doc] [PDF]	DG
Tue 16 Nov	CLUSTERING 2. Hierarchical agglomerative clustering. Clustering terms using documents. Labelling clusters. Evaluating clustering. Text-specific issues.	[powerpoint] [PDF]	PR	FSNLP Ch. 13 Single-Link and Complete-Link Clustering	PE2 out [doc] [PDF]
Thu 18 Nov	CLUSTERING 3. Reduced dimensionality/spectral methods. Latent semantic indexing (LSI). Applications to clustering and to information retrieval.	[powerpoint] [PDF]	PR	FSNLP 15.4 Projections for Efficient Document Clustering Spectral Analysis of Data Latent semantic indexing Random projection theorem Faster random projection	PS2 due
Fri 19 Nov	Review session for PE2	[doc] [PDF]	LE, DG
Tue 23 Nov	CLASSIFICATION 1. Vector space classification using hyperplanes; centroids; k Nearest Neighbors.	[powerpoint] [PDF]	CM	FSNLP Ch. 16 A Comparative Study on Feature Selection in Text Categorization Evaluating and Optimizing Autonomous Text Classification Systems Dumais, Platt, Heckerman, and Sahami. 1998. Inductive learning algorithms and representations for text categorization. CIKM 1998. Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2001. Reuters dataset
Thu 25 Nov	Thanksgiving (no class)
Tue 30 Nov	CLASSIFICATION 2. Support vector machine classifiers. Kernel functions.	[powerpoint] [PDF]	CM	FSNLP Ch. 16 A Tutorial on Support Vector Machines for Pattern Recognition Conceptual Information Retrieval Using RUBRIC Dumais. Using SVMs for Text Categorization. IEEE Intelligent Systems 13(4) Jul-Aug 1998. Text Categorization Based on Regularized Linear Classification Methods	PE2 due
Thu 2 Dec	CLASSIFICATION 3. Text classification. Exploiting text-specific features. Feature selection. Evaluation of classification. Micro- and macro-averaging. Comparative results.	[powerpoint] [PDF]	CM	Machine Learning in Automated Text Categorization A Re-examination of Text Categorization Methods A Comparative Study on Feature Selection in Text Categorization
Fri 3 Dec	Review session for final	[doc] [PDF]	LE, DG
Thu 9 Dec	Final Exam 12:15pm-3:15pm

Back to the CS276A home page

CS 276A / LING 239I Text Retrieval and Mining Autumn 2004