CS 124 / LING 180 From Languages to Information, Dan Jurafsky, Winter 2014
Week 4: Group Exercises on IR Feb 4, 2014

  1. Part 1: Group Exercise

    Compute cosines to find out whether Doc1 or Doc2 will be ranked higher for the two-word query "Linus pumpkin", given these counts for the (only) 3 documents in the corpus:

        term    Doc1      Doc2     Doc3
        ---------------------------------------
        Linus   10        0        1
        Snoopy  1         4        0
        pumpkin 4       100       10
    

    Do this by computing the tf-idf cosine between the query and Doc1 and the cosine between the query and Doc2, and choose the higher value. You should the ltc.lnn weighting variation (remember that's ddd.qqq), using the following table:
































Part 2: Challenge Problems

  1. Consider two documents A and B whose Euclidean distance is d and cosine similarity is c (using no normalization other than raw term frequencies). If we create a new document A' by appending A to itself and another document B' by appending B to itself, then:
    1. What is the Euclidean distance between A' and B' (using raw term frequency)?

    2. What is the cosine similarity between A' and B' (using raw term frequency)?

    3. What does this say about using cosine similarity as opposed to Euclidean distance in information retrieval?

  2. Is it important to remove stop words in a system that uses idf in its weighting scheme?