HOMEWORK 3
Question 1
The following is a very small, invented word × document matrix (words A, B, C, D; documents d1 ... d5):
| d1 | d2 | d3 | d4 | d5 |
A | 10 | 15 | 0 | 9 | 10 |
B | 5 | 8 | 1 | 2 | 5 |
C | 14 | 11 | 0 | 10 | 9 |
D | 13 | 14 | 10 | 11 | 12 |
(A CSV version of the matrix for use with spreadsheet and matrix programs.)
Your tasks:
- (2.5 points)
For each word A, B, C, and D, calculate and provide its Euclidean
distance from word A, and then use those values to rank all of the
words with respect to their closeness to A (closest = 1; farthest =
4). (For each word, you should provide its distance and rank.)
- (2.5 points)
Normalize each row of the matrix by length (definition below),
recalculate the Euclidean distances of all the words from word A,
and recalculate the ranking with respect to A. (For each word, you
should provide its distance and rank. You needn't provide the
length-normalized matrix.)
- (2 points)
If the ranking changed between step 1 and step 2, provide a brief
explanation for the nature of that change, and try to articulate
why it changed. If the ranking did not change, provide a brief
explanation for why normalization did not have an effect here.
- Euclidean distance
- The Euclidean distance between vectors \(x\) and \(y\) of dimension \(n\) is
\(
\sqrt{\sum_{i=1}^{n} |x_{i} - y_{i}|^{2}}
\)
- Length (L2) normalization
- Given a vector \(x\) of dimension \(n\), the normalization of
\(x\) is a vector \(\hat{x}\) also of dimension \(n\) obtained by
dividing each element of \(x\) by \(\sqrt{\sum_{i=1}^{n} x_{i}^{2}}\)
Question 2
(3 points) Here is another invented word × document matrix (words A, B, C; documents d1, d2, d3):
| d1 | d2 | d3 |
A | 1 | 0 | 0 |
B | 1000 | 1000 | 4000 |
C | 1000 | 2000 | 999 |
Calculate the pointwise mutual information (PMI) for cells (A, d1) and
(B, d3), as defined in equation 4 of Turney and Pantel (p. 157).
What is problematic about the values obtained? How might we address
the problem, so that the PMI values are more intuitive?