HOMEWORK 1
CS 224U / LING 188/288, Spring 2014, Stanford
Question 1: Distance functions [2 points]
The following is a very small, invented word × document matrix (words A, B, C, D; documents d1 ... d5):
| d1 | d2 | d3 | d4 | d5 |
A | 10 | 15 | 0 | 9 | 10 |
B | 5 | 8 | 1 | 2 | 5 |
C | 14 | 11 | 0 | 10 | 9 |
D | 13 | 14 | 10 | 11 | 12 |
(A CSV version of the matrix for use with spreadsheet and matrix programs. Also, the
code for questions 3 and 4 contains functions that will solve this problem and the next for you.)
Your tasks:
-
For each word A, B, C, and D, calculate and provide its Euclidean
distance from word A, and then use those values to rank all of the
words with respect to their closeness to A (closest = 1; farthest =
4). (For each word, you should provide its distance and rank.)
-
Normalize each row of the matrix by length (definition below),
recalculate the Euclidean distances of all the words from word A,
and recalculate the ranking with respect to A. (For each word, you
should provide its distance and rank. You needn't provide the
length-normalized matrix.)
-
If the ranking changed between step 1 and step 2, provide a brief
explanation for the nature of that change, and try to articulate
why it changed. If the ranking did not change, provide a brief
explanation for why normalization did not have an effect here.
- Euclidean distance
- The Euclidean distance between vectors \(x\) and \(y\) of dimension \(n\) is
\(
\sqrt{\sum_{i=1}^{n} |x_{i} - y_{i}|^{2}}
\)
- Length (L2) normalization
- Given a vector \(x\) of dimension \(n\), the normalization of
\(x\) is a vector \(\hat{x}\) also of dimension \(n\) obtained by
dividing each element of \(x\) by \(\sqrt{\sum_{i=1}^{n} x_{i}^{2}}\)
Question 2: PMI [2 points]
Here is another invented word × document matrix (words A, B, C; documents d1, d2, d3):
| d1 | d2 | d3 |
A | 1 | 0 | 0 |
B | 1000 | 1000 | 4000 |
C | 1000 | 2000 | 999 |
Calculate the pointwise mutual information (PMI) for cells (A, d1) and (B, d3), as defined in
equation 4 of Turney and Pantel (p. 157). Give these
values, and articulate, in 1-2 sentences, what is problematic about the values obtained.
Question 3: Semantic neighbors [2 points]
This question and the next require a little code distribution. You can download it
for use on your machine (it requires Python with Numpy and Scipy): cs224u-hw1.zip.
Alternatively, you can run it from your Stanford Unix account from this directory: /afs/ir/class/cs224u/hwcode/hw1/.
You don't have to know Python; you'll need it just to make some function calls.
To start exploring this method, navigate to the directory
containing the code and data distribution, enter the Python shell by
typing python at the prompt, and then
execute these commands, which load the code and matrix:
- from vsm import *
- m = Matrix('imdb-wordword.csv')
The file imdb-wordword.csv contains a
(2998 × 2998) word × word matrix, where the count in
cell m[i, j] is the number of times that word i and word j occurred
together in the same text (a review from IMDB).
The function neighbors lets you see the
words that are closest to a given word according to a distance
measure:
- neighbors(m, 'scary', distfunc=cosine_distance, n=5)
- [('scary', 0.0), ('scare', 0.0033059029955925245), ('thing', 0.0040114240503390519), ('just', 0.0041900487203675452), ('cheesy', 0.0042465840121400644)]
- neighbors(m, 'scary', distfunc=euclidean_distance, n=5)
- [('scary', 0.0), ('seriously', 2734.9711150211442), ('basically', 2777.8320683583447), ('annoying', 2809.6780954408282), ('ridiculous', 2811.8454793960495)]
The code also has two basic reweighting functions: tfidf and pmi.
The pmi implementation has options for specifying Positive PMI and performing the discounting procedure that
Turney and Pantel describe. Here are example function calls (which might take a few
minutes to finish):
- p = pmi(m, positive=False, discounting=False)
- t = tfidf(m)
These new matrices behave like the original, so you can do things like this:
- neighbors(p, 'scary', distfunc=cosine_distance, n=5)
- [('scary', 2.2204460492503131e-16), ('horror', 0.44045320622678008), ('just', 0.52984219031468927), ('gore', 0.54364357758755311), ('creepy', 0.54628422820431433)]
- neighbors(t, 'scary', distfunc=euclidean_distance, n=5)
- [('scary', 0.0), ("it's", 6.2215154001902769e-05), ('blood', 6.6180613306240124e-05), ('suspense', 6.7280395622054464e-05), ('m', 6.9636551744935624e-05)]
Finally, the code implements Latent Semantic Analysis (truncated
singular value decomposition). Here's how to use
it; k gives the number of column
dimensions to use:
- sem = lsa(m, k=100)
- neighbors(sem, 'scary', distfunc=cosine_distance, n=5)
- [('scary', 0.0), ('boring', 0.001122878286416662), ('scare', 0.0011598457991651712), ('cheesy', 0.0012451811570568516), ('lame', 0.0015101458334525475)]
LSA can in principle be combined with (preceded by) either of the
reweighting schemes discussed above, and it is in principle
compatible with either distance metric on vectors.
Your task: Sample around for a little while,
seeing what neighbors turns up for different parts of the lexicon,
and then venture a proposal about how best to build a VSM from
this data:
- Which words did you use to explore? (List them.)
- Which weighting scheme should we use, if any? (If you choose PMI, be sure to say which options you favor.)
- Should we run LSA? If so, what value of k seems best?
- Which distance metric should we use?
The idea is that these should all work together to deliver the
final VSM that we use for semantics. You need only give
answers. We're not requesting justification.
Question 4: Semantic orientation [4 points]
The semantic orientation method
of Turney and
Littman 2003 is offered as a general method for building lexicons
for any desired semantic dimension using just similarity relations on
vector representations. Here are the steps:
- Define two seed-sets \(S_{1}\) and \(S_{2}\) of words (they should be opposing in some way that is appropriate for your matrix).
- Pick a vector distance measure \(dist\).
- For the word \(w\) of interest, use \(dist\) to get the sum of all distances between \(w\) and the words in \(S_{1}\). Do the same for \(S_{2}\).
- The score is the sum for \(S_{1}\) minus the sum for \(S_{2}\).
- Semantic orientation summarized
- For a given distance metric \(dist\), word \(w\), and seed sets \(S_{1}\) and \(S_{2}\),
\[
(\sum_{w' \in S_{1}} dist(w, w'))
-
(\sum_{w' \in S_{2}} dist(w, w'))
\]
The code vsm.py implements this
as semantic_orientation, which
ranks the entire vocabulary according to the two seed sets. Here is a call
with the default arguments, which use Turney and Littman's original
seed-sets:
- m = Matrix('imdb-wordword.csv')
- so = semantic_orientation(m,
seeds1=['bad', 'nasty', 'poor', 'negative', 'unfortunate', 'wrong', 'inferior'],
seeds2=['good', 'nice', 'excellent', 'positive', 'fortunate', 'correct', 'superior'],
distfunc=cosine_distance)
- so[ : 5] # words most associated with seeds1
- [('1/10', -0.01734046484573426), ('suck', -0.015119973084056548), ('gonna', -0.014471191150799867), ('crap', -0.014289001238634969), ('renting', -0.013775117639892809)]
- so[-5: ] # words most associated with seeds2
- [('breathtaking', 0.015729692669502304), ('titanic', 0.015771864952280112), ('victoria', 0.016104061915204637), ('powell', 0.019246824409916319), ('columbo', 0.019610017731074403)]
Your task:
- Make sure you have a VSM set up in the way you advocated for in question 3.
- Pick one semantic opposition (distinct from pos/neg sentiment) that works well in this VSM, built from this data.
Name the opposition, give your seed sets, and paste in the top 5 and bottom 5 values, as above.
- As in 2, but now with a semantic opposition that seems not to work at all with this VSM and method.