CS 124 / LING 180 From Languages to Information, Dan Jurafsky, Winter 2016
Week 3: Group Exercises on Text Cat/NB/Sentiment Jan 26, 2016

  1. Part 1: Group Exercise

    We want to build a naive bayes sentiment classifier using add-1 smoothing, as described in the lecture (not binary naive bayes, regular naive bayes). Here is our training corpus:

    Training Set:

        - just plain boring 
        - entirely predictable and lacks energy
        - no surprises and very few laughs 
        + very powerful 
        + the most fun film of the summer 

    Test Set:

        predictable with no originality 

Compute the prior for the two classes + and -, and the likelihoods for each word given the class (leave in the form of fractions). Then compute whether the sentence in the test set is of class positive or negative (you may need a computer for this final computation). Make sure you know the correct the Bayes equation to use to compute a value for each class in order to answer this question. What would the answer be without add-1 smoothing? (Optional: would using binary multinomial Naive Bayes change anything?)

Part 2: Challenge Problems

  1. Go to the Sentiment demo at http://text-processing.com/demo/sentiment/. Come up with 5 sentences that the classifier gets wrong. Can you figure out what is causing the errors?

  2. Binary multinomial NB seems to work better on some problems than full count NB, but full count works better on others. For what kinds of problems might binary NB be better, and why?

A note on feature selection-
Consider this corpus (the first character depicting class)-

+ The chase sequences left me gasping for breath
+ This movie is awesome
+ Preston , being the amazing actor he is , plays saving grace
- Do NOT watch the movie . It is a waste of time
- Disgusting !
- Portman is reduced to mere eye candy - what a pity

And now look at the test sentence-

The movie is horrid

What is the label that you get? Ne:xt - are there features which misguide the classification? Is there some way you can leave those features out of consideration?