### CS 124 / LING 180 From Languages to Information, Dan Jurafsky, Winter 2017 Week 3: Group Exercises on Language Modeling, Tuesday Jan 24, 2017

1. Part 1: Group Exercise

We are interested in building a language model over a language with three words: A, B, C. Our training corpus is

```AAABACBABBBCCACBCC
```

1. First train a unigram language model using maximum likelihood estimation. What are the probabilities? (Just leave in the form of a fraction)? Reminder: We don't need start or end tokens for training a unigram model, since the context of each word doesn't matter. So, we will not add any special tokens to our corpus for this part of the problem.
P(A) =

P(B) =

P(C) =

2. Next train a bigram language model using maximum likelihood estimation. For this problem, we will add an end token, \$\$, at the end of the string, so that we can model the probability of the sentence ending after a particular word. If you chose to add a start token as well, that's fine too, but these solutions assume no start token. Fill in the probabilities below. Leave your answers in the form of a fraction.
P(A|A) =

P(A|B) =

P(A|C) =

P(B|A) =

P(B|B) =

P(B|C) =

P(C|A) =

P(C|B) =

P(C|C) =

3. Now evaluate your language models on the corpus
```ABACABB
```
What is the perplexity of the unigram language model evaluated on this corpus? Since we didn't add any special start/end tokens when we were training our unigram language model, we won't add any when we evaluate the perplexity of the unigram language model, either, so that we're consistent.

What is the perplexity of the bigram language model evaluated on this corpus? Since we added an end token when we were training our bigram model, we'll add an end token to this corpus again before we evaluate perplexity.

4. Now repeat everything above for add-1 smoothing.

Part 2: Challenge Problems
1. What is the difference between using an UNK token (for unknown words) and smoothing? What situations would you use one versus the other?

2. Suppose you build an interpolated trigram language model, with three weights lambda1 for unigrams, lambda2 for bigrams, and lambda3 for trigrams. Normally we set these lambdas on a held-out set. Suppose instead we set them on the training data. This will cause the lambdas to take on very unusual values. What will these lambdas look like? Why?

3. Show that if we estimate two bigram language models using unsmoothed relative frequencies (MLE), one from a text corpus and the second from the same corpus in reverse order, the models will assign the same probability to new sentences (when applied in forward and backward order respectively). Hint: First write out the entire equation for sentence probabilities in terms of counts.