Underlying Algorithms

The Viterbi Algorithm

An extension of the dynamic programming idea of backward induction, this is made possible by the Markovian property.

Notation can be a little confusing, the number of possible states will be S, (2, for the unfair casino, 3 for the weather: Sunny, cloudy, rainy, 8 for the CpG islands).

The transitions between states is an \(S\times S\) stochastic matrix, we then have observables, in the unfair casino, the dice roll, in the weather example the states of the seaweed: soggy, dry, damp, wet. There may the sounds that are heard. For these observables we suppose we have a confusion matrix which is the matrix of probabilities of seeing a certain observable, given a state, \(P(6 \mid F)=1/6\), \(P(wet\mid rainy)\), in the specific case of the nucleotides we know for sure that if we see an A, the underlying state can either be a A+ or A-, the other probabilities are 0. We saw that we denote the emission probabilities by indexing by the state: \(e_{s}(o)=P(\mbox{see o}\mid \mbox{state is } s)\).

We compute partial probabilities up to a point.

We call \(v_k(i)\) the most probable path ending in state k with observation \(i\) equal to \(x_i\), for all possible states k.

Calculation of \(v_Z(i)\), the most probable path to the state \(Z\) will have to path through A, B, or C at step \(i-1\). The most probable path to Z at step \(i\) will be either:

...sequence of states ending in A,Z ...sequence of states ending in B,Z ...sequence of states ending in C,Z

Viterbi Algorithm

We want the path ending in A,Z , B,Z or C,Z with an observation of X at step Z that has the highest (complete)probability. The path going through A has probability

\[P(\mbox{most probable path to A}) P(Z\mid A) P(\mbox{observation X}\mid Z)\]

The other two probabilities are the same with A replaced by B or C. We have to compute these three probabilities and take the maximum one:

\[P(\mbox{Z at step }i)=\max_{\ell=A,B,C} P(\ell \mbox{ at step }i-1)\times P(Z\mid \ell) P(\mbox{observation }X\mid Z)\]
\[v_Z(i)=e_Z(X_i)\max_k \{v_{k}(i-1)\times a_{kl}\}\]

Step-0: Initialize (i=0): \(v_0 (0) = 1, v_k(0) = 0\) for \(k>0\)

Steps- \(i=1..L\): \(v_l(i)=e_l(x_i ) max_k \{v_k(i-1) a_{kl} \}\)

* The Viterbi algorithm employs the strategy of Dynamic Programming * Probabilities should be expressed in a log space to avoid underflow errors *We need a backward pointer to the state \(k^*\) that gave the \(v_\ell(i)\) max, the backward pointer doesn’t need the emission probabilities, since for the same state \(\ell\) the emission probabilities are the same, so the pointer is

\[\phi_i(\ell)=arg max_j v_j(i-1)a_{jl}\]

Actual Example: Three states of the weather Sun, Cloudy, Rainy with transition matrix

\[\begin{split} A=\left( \begin{array}{rrrr} &Sunny & Cloudy & Rainy\\ S&0.5 & 0.25 & 0.25\\ C&0.375 & 0.125 & 0.375\\ R&0.125 & 0.675 & 0.375\\ \end{array} \right) \qquad \mbox{Stationary probabilities: } \begin{array}{ccc} Sunny & Cloudy & Rainy\\ 0.63 & 0.17 & 0.20\\ \end{array}\end{split}\]\[Three observables: sea weed is Soggy, Damp, Dryish, Brittle.\]
\[\begin{split}\mbox{Confusion /Emission matrix:}\qquad \begin{array}{rrrrr} & Brittle& Dryish &Damp &Soggy\\ Sunny&0.6 & 0.2 & 0.15 & 0.05\\ Cloudy&0.25 & 0.25 & 0.25 &0.25\\ Rainy& 0.05 & 0.10 & 0.35 & 0.50\\ \end{array}\end{split}\]

We saw Brittle, Damp, Soggy... What was the weather over those 3 days? Initializations with first observation Brittle:

\[\begin{split}\begin{aligned} P(Brittle\mid sunny)P(sunny)&=&0.6 \times 0.63=0.378\\ P(Brittle\mid cloudy)P(cloudy)&=&0.25 \times 0.17=0.0425\qquad \begin{array}{lr} sunny & 0.378\\ cloudy & 0.0425\\ rainy & 0.01 \end{array} \\ P(Brittle\mid rainy)P(rainy)&=&0.05 \times 0.20=0.01\end{aligned}\end{split}\]

Step one: State 1: sunny. \(P(sunny\mid sunny) \times P(Damp\mid sunny) \times 0.378=0.5 * 0.15 * 0.378= 0.02835\) \(P(sunny\mid cloudy) \times P(Damp\mid sunny) \times 0.0.0425=0.375 * 0.15 * 0.0425= 0.00239\) \(P(sunny\mid rainy) \times P(Damp\mid sunny) \times 0.0.001=0.125 * 0.15 *0.01= 0.0001875\)

\[\begin{split}\begin{aligned} v_2(sunny)&=& max(0.5 * 0.378,0.375 * 0.0425,0.125 * 0.01)*0.15=0.0284 \; \mbox{point from sunny(2) to sunny(1)}.\\ \mbox{State 2 }&:& {\mbox cloudy.}\\ v_2(cloudy)&=&max(0.378*0.25,0.0425*0.125,0.01*0.675)*0.25=0.024\; \mbox{cloudy(2) points to sunny(1)}\\ \mbox{State 3 }&:& {\mbox rainy.}\\ v_2(rainy)&=&max(0.378*0.25,0.0425*0.375,0.01*0.375)*0.35=0.0033\; \mbox{rainy(2) points to sunny(1)}\end{aligned}\end{split}\]

Step two:

\[\begin{split}\begin{aligned} \mbox{State 1 }&:& {\mbox sunny.}\\ v_3(sunny)&=&max(0.0284*0.5,0.024*0.25,0.033*0.125)*0.05=0.0142*0.05=0.00071 \\&& \mbox{ sunny(3) points to sunny(2)}\\ \mbox{State 2 }&:& {\mbox cloudy.}\\ v_3(cloudy)&=&max(0.0284*0.25,0.024*0.125,0.033*0.675)*0.25=0.022275*0.25=0.00556875\;\\&& \mbox{ cloudy(3) points to rainy(2)}\\ \mbox{State 3 }&:& {\mbox rainy.}\\ v_3(rainy)&=&max(0.0284*0.25,0.024*0.375,0.033*0.375)*0.50=0.012375*0.5=0.0061875 \\ && \mbox{rainy(3) points to rainy(2)}\end{aligned}\end{split}\]

Advantages of Viterbi: Computational complexity is reduced. Not deterred by noise in the middle, looks at the whole path before deciding which is the best.

Disadvantages: We don’t get to see any slightly suboptimal paths. We will remedy this, when we look at the Gibbs sampler.

Forward Algorithm

We have a sequence \(x\) for which we would like to compute the probability:

\[P(x)=\sum_\pi P(x,\pi)\]

In fact we can compute this with a similar algorithm where the maximum is replaced by sums:

\[f_k(i)=P(x_1,x_2,\ldots x_i, \pi_i=k)\qquad f_\ell(i+1)=e_\ell(x_{i+1})\sum_k f_k(i)a_{kl}\]

For a sequence of length L :\(P(x)=\sum_k f_k(L)\).

Backward Algorithm

Suppose we observe a sequence \(x\) and want to evaluate \(P(\pi_i=k\mid x)\).

\[P(x,\pi_i=k)=P(x_1,\ldots, x_i,\pi_i=k)P(x_{i+1}\ldots x_L\mid \pi_i=k)\]

The second term is called \(b_k(i)\) and the first is \(f_k(i)\), can be computed backwards from the ending sequence, in the same way the forward algorithm worked.

\[b_k(i)=\sum a_{kl}e_l(x_{i+1})b_l(i+1)\qquad P(\pi_i=k\mid x)= \frac{f_k(i)b_k(i)}{P(x)}\]

Estimation of the Hidden Markov Model

Parameters

If a state sequence is known, we have an observed matrix of counts of the number of times j followed i in the matrix \(A_{ij}\) and the number of counts that state s emitted observable \(l\) as a matrix \(E_{sl}\), maximum likelihood theory gives as estimates for a and e:

\[a_{kl}=\frac{A_{kl}}{\sum_{l'}A_{kl'}} \qquad e_{k}(b)=\frac{E_{k}(b)}{\sum_{b'}E_{k}(b)}\]

We may have to use pseudocounts (equivalent to a Dirichlet prior) if the numbers are small or zero.

Estimation of the Hidden Markov Model

Parameters

If a state sequence is known, we have an observed matrix of counts of the number of times j followed i in the matrix \(A_{ij}\) and the number of counts that state s emitted observable \(l\) as a matrix \(E_{sl}\), maximum likelihood theory gives as estimates for a and e:

\[a_{kl}=\frac{A_{kl}}{\sum_{l'}A_{kl'}} \qquad e_{k}(b)=\frac{E_{k}(b)}{\sum_{b'}E_{k}(b)}\]

We may have to use pseudocounts (equivalent to a Dirichlet prior) if the numbers are small or zero.

Baum-Welch , or EM specialized to HMM

\[E_{k}(b)=\sum_j \frac{1}{P(x^j)} \sum_{i\mid x_i^j=b} f_k^j(i)b^j_k(i)\]

We want to find the parameters (the model) that maximizes

\[log P(x\mid \theta)=log \sum_\pi P(x,\pi\mid \theta)\]

\(x\) means all the observations, including all the different sequences \(x^j, j=1...n\) We build a sequence of models \(\theta^t\), the missing data are the paths \(\pi\). The EM algorithm calculates \(A_{kl}\) and \(E_k(b)\) as the expected number of times each transition/emission occurs, given the training sequences. \(P(\hat{a}_{kl}\) used at position \(i\) in sequence \(x)\) \(=P(\pi_i=k,\pi_{i+1}=l|x,\theta)=\frac{f_k(i)a_{kl}e_l(x_{i+1})b_l(i+1)}{P(x)}\) thus the estimate by expectation(averaging) is:

\[A_{kl}=\sum_j \frac{1}{P(x^j)} \sum_i f_k^j(i)a_{kl}e_l(x^j_{i+1})b^j_l(i+1)\]

Initialize at arbitrary parameter values For each sequence \(j=1\ldots n\):

  • Calculate \(f_k(i)\) for sequence \(j\) using the forward algorithm.
  • Calculate \(b_k(i)\) for sequence \(j\) using the backward algorithm.
  • Add the contribution of sequence \(j\) to \(A\) and \(E\).

Calculate the new model parameters from the new A and E. Calculate the new log likelihood of the model. Stop when the incremental change is small, or that there are enough iterations.

Structure of HMMs

Prior knowledge should be used as optimally as possible.

Duration

* Length of extent of model: staying with the same nucleotide frequencies for instance:

  • Exponentially decaying length (geometric) \(P (\mbox{k residues}) = (1-p)p^{k-1}\)
  • Defined range of length, e.g., Model with distribution of lengths

between 2 and 8

  • Non-geometric length distribution, e.g., array of \(n\) states

* Silent (or, Null) states, e.g. Begin, End states Do not emit any symbols

Are used to simplify the models, reduce the number of transition probabilities to estimate from training data.

No loops consisting entirely of silent states are allowed. All HMM-related algorithms can be easily adjusted to incorporate silent states.