CS 124 / LING 180 From Languages to Information, Dan Jurafsky, Winter 2015
Week 4: Group Exercises on IR Feb 3, 2014

  1. Part 1: Group Exercise

    1. An IR system returns eight relevant documents and ten non-relevant documents. There are a total of twenty relevant documents in the collection. What is the precision of the system on this search, and what is its recall?







    2. Draw the inverted index that would be built for the following document collection.

          Doc 1: new home sales top forecasts
          Doc 2: home sales rise in july
          Doc 3: increase in home sales in july
          Doc 4: july new home sales rise
          













    3. Compute cosines to find out whether Doc1 or Doc2 will be ranked higher for the two-word query "Linus pumpkin", given these counts for the (only) 3 documents in the corpus:

        term    Doc1      Doc2     Doc3
        ---------------------------------------
        Linus   10        0        1
        Snoopy  1         4        0
        pumpkin 4       100       10
    

    Do this by computing the tf-idf cosine between the query and Doc1 and the cosine between the query and Doc2, and choose the higher value. You should the ltc.lnn weighting variation (remember that's ddd.qqq), using the following table:
































Part 2: Challenge Problems

  1. Do modern web search engines use stemming? If so, are all suffixes removed or just some of them? How to search engines deal with Boolean terms like OR or AND? Do some experimenting with Google, Bing, DuckDuckGo, or your favorite search engines.
  2. Consider two documents A and B whose Euclidean distance is d and cosine similarity is c (using no normalization other than raw term frequencies). If we create a new document A' by appending A to itself and another document B' by appending B to itself, then:


    1. What is the Euclidean distance between A' and B' (using raw term frequency)?

    2. What is the cosine similarity between A' and B' (using raw term frequency)?

    3. What does this say about using cosine similarity as opposed to Euclidean distance in information retrieval?

  3. Is it important to remove stop words in a system that uses idf in its weighting scheme?