CS276A / LING 239I
Text Retrieval and Mining
Autumn 2004

Assignments FAQ and Clarifications

#1f. You can assume a bigram language model. Your answer should reflect a general approach, not something appropriate only for this specific corpus.
#3c. This is basically a math question. Rho is defined as the normalized vector v (||v|| = 1) that maximizes the expression in brackets. So the question is asking how you actually use that definition to compute rho. Your answer should be an expression of rho in terms of the given parameters and vectors, without any sort of argmax operator. The derivation is only a few steps. More hints: you shouldn't try to take the derivative. Instead, manipulate the expression to get it in the form of v dot something constant. Then you can find rho because given a vector b, the length-normalized vector a that maximizes a dot b is just b/|b|.
#4. Assume no interpolation.

A note on the getPrecision method in the starter code: the way I calculate it might be a bit different from what you'd expect. Specifically, if for a particular query q there are only k relevant documents, where k < 10, I only consider the top k hits. The rationale for this is that if there are only, say, 5 relevant documents and you calculate the top 10 precision on your search results, you can't possibly do better than 5 out of 10. So it makes more sense to calculate the top 5 precision for this query. Anyway, the upshot is that you need to make sure you're calculating your overall average precision correctly given this slight modification -- e.g. you can't assume that the denominator is 225*10.
The original query.text file had three random question marks in it that will cause errors when you try to parse it. Copy the new version from the assignment directory (or remove them yourself).
You must use your subclass of Similarity in your IndexSearcher in addition to your IndexWriter. Call IndexSearcher.setSimilarity in your runqueries code.

#1. Some people consider the unary representation of a number x to be (x-1) ones followed by a zero, or (x-1) zeroes followed by a one, or 317 zeroes followed by log x ones, or...? In any case, just stick to what we use in class, which is x ones followed by a zero.
#2b. Typo! This should say, "Describe each set of values for which..."
#2c,d. Note that part c asks where the document satisfies the NEAR/x clause, which means you must enumerate all pairs of positions that satisfy the clause; whereas part d asks whether the document satisfies the clause, which means you merely need to determine if any such pairs exist. This distinction makes a difference: the algorithm for part d can operate somewhat differently from the one in part c. For part c, be sure to consider multiple cases (L > x vs. L <= x). For part d, only one of the three options is true.
#4c. If you find yourself facing a summation similar to the one in lecture 3 and don't know how to solve it by hand, don't worry... I'm not sure it's even possible. Just write a few lines of code.
#5. Pointers can be of non-standard size (they don't have to be 32 bits; they don't even have to be byte-aligned or word-aligned).
#6. conflate: to bring together, combine. When a stemmer conflates two terms, it maps them to the same stemmed form.

Back to the CS276A homepage
Last modified: October 7, 2004