 |
CS276A / LING 239I
Text Retrieval and Mining
Autumn 2004 |
Assignments FAQ and Clarifications
Problem set 2:
- #1f. You can assume a bigram language model. Your answer should
reflect a general approach, not something appropriate only for this
specific corpus.
- #3c. This is basically a math question. Rho is defined as the
normalized vector v (||v|| = 1) that maximizes the expression in brackets. So
the question is asking how you actually use that definition to compute
rho. Your answer should be an expression of rho in terms of the given
parameters and vectors, without any sort of argmax operator. The
derivation is only a few steps. More hints: you shouldn't try to take the
derivative. Instead, manipulate the expression to get it in the form of v
dot something constant. Then you can find rho because given a vector b,
the length-normalized vector a that maximizes a dot b is just b/|b|.
- #4. Assume no interpolation.
Practical exercise 1:
- A note on the getPrecision method in the starter code: the way I
calculate it might be a bit different from what you'd
expect. Specifically, if for a particular query q there are only k
relevant documents, where k < 10, I only consider the top k hits. The
rationale for this is that if there are only, say, 5 relevant documents
and you calculate the top 10 precision on your search results, you can't
possibly do better than 5 out of 10. So it makes more sense to calculate
the top 5 precision for this query. Anyway, the upshot is that you need to make
sure you're calculating your overall average precision correctly given
this slight modification -- e.g. you can't assume that the denominator is 225*10.
- The original query.text file had three random question marks in it
that will cause errors when you try to parse it. Copy the new version from
the assignment directory (or remove them yourself).
- You must use your subclass of Similarity in your
IndexSearcher in addition to your IndexWriter. Call
IndexSearcher.setSimilarity in your runqueries code.
Problem set 1:
- #1. Some people consider the unary representation of a number x to be
(x-1) ones followed by a zero, or (x-1) zeroes followed by a one, or 317
zeroes followed by log x ones, or...? In any case, just stick to what we use in class,
which is x ones followed by a zero.
- #2b. Typo! This should say, "Describe each set of values for which..."
- #2c,d. Note that part c asks where the document satisfies the
NEAR/x clause, which means you must enumerate all pairs of positions that
satisfy the clause; whereas part d asks whether the document
satisfies the clause, which means you merely need to determine if any such
pairs exist. This distinction makes a difference: the algorithm for part d
can operate somewhat differently from the one in part c. For part c, be
sure to consider multiple cases (L > x vs. L <= x). For part d, only one
of the three options is true.
- #4c. If you find yourself facing a summation similar to the one in
lecture 3 and don't know how to solve it by hand, don't worry... I'm not
sure it's even possible. Just write a few lines of code.
- #5. Pointers can be of non-standard size (they don't have to be 32
bits; they don't even have to be byte-aligned or word-aligned).
- #6. conflate: to bring together, combine. When a stemmer conflates
two terms, it maps them to the same stemmed form.
Back to the CS276A homepage
Last modified: October 7, 2004