Hidden Markov Model for CpG islands¶
Does a Short DNA Stretch Come from a CpG Island?¶
Here we use likelihood scores as discriminatory statistic.
We need to model both the CpG Islands and the regions that are not CpG Islands. For this we need training sequences. We can collect databases of nucleotides from both types of sequences.
* Collect a database of 60,000 nucleotides * Extract 48 putative CpG Islands * For the putative CpG Regions compute the transition probabilities from nucleotide \(s\) to nucleotide \(t\) using a Markovian Model
\(c^+_{s,t}\) is the number of times that \(s\) is followed by \(t\) in the database of CpG Islands.
* Similarly for the regions without CpG Islands
* Construct table of transition probabilities
Table of Transition Probabilities for CpG Islands
Model + | A | C | G | T |
---|---|---|---|---|
A | .180 | .274 | .426 | .120 |
C | .171 | .368 | .274 | .188 |
G | .161 | .339 | .375 | .125 |
T | .079 | .355 | .384 | .182 |
station | 0.155 | 0.341 | 0.350 | 0.154 |
Table of Transition Probabilities for non CpG Islands
Model - | A | C | G | T |
---|---|---|---|---|
A | .300 | .205 | .285 | .210 |
C | .322 | .298 | .078 | .302 |
G | .248 | .246 | .298 | .208 |
T | .177 | .239 | .292 | .292 |
station | 0.262 | 0.246 | 0.239 | 0.253 |
Calculate the Log-Odds ratio for a sequence \(x\):
Scores \(S(x)\) allow discrimination of a model (+) against another (-)
We can summarize the computations by writing once and for all the \(log_2 \beta\) matrix, and adding each relevant term from this matrix, for instance from the two transition matrices above we have
thus we can calculate the table, rather than recompute each \(\beta\) every time:
> statp=c(0.155,0.341,0.350,0.154)
> statm=c(0.262,0.246,0.239,0.253)
> statp/statm
[1] 0.5916031 1.3861789 1.4644351 0.6086957
> log(statp/statm,2)
[1] -0.7572986 0.4711134 0.5503443 -0.7162070
Score Computation¶
Thus the sequence \(x= CACTAAGCTA\) would have a score: \(-0.913+0.419+1.812-1.169-0.740+0.580+0.461-0.685-1.169-0.757=-1.404\) normalised by length: \(-2.161/9=-0.2401\) (bits) To conclude whether a sequence is from a CpG island or not, we just compute the score for the sequence, here it is negative thus indicating that this is not a CpG island, the histogram for known sequences shows a cutoff at around -0.05 and 0.1, with an uncertain region in between.
This is a first instance of what is known as discriminant analysis in statistics, given some training data, we build a model and a discriminating statistic (the score here) that allows us to guess which group (+ or -) an anonymous observation belongs to.
More about finding the stationary distribution¶
Aplus=scan()
.180 .274 .426 .120
.171 .368 .274 .188
.161 .339 .375 .125
.079 .355 .384 .182
Aplus=matrix(Aplus,4,4,byrow=T)
> eigen(t(Aplus))
$values
[1] 1.00034120+0.00000000i 0.11202252+0.00000000i -0.00368186+0.05741808i
[4] -0.00368186-0.05741808i
$vectors
[,1] [,2] [,3] [,4]
[1,] 0.3868925+0i 0.5198328+0i 0.5541003-0.5988781i 0.5541003+0.5988781i
[2,] 0.8536965+0i -0.5304369+0i 0.6945813+0.2868043i 0.6945813-0.2868043i
[3,] 0.8749564+0i 0.6644252+0i -1.0666597+0.8310571i -1.0666597-0.8310571i
[4,] 0.3865033+0i -0.6532238+0i -0.1826954-0.5193075i -0.1826954+0.5193075i
v1_eigen(t(Aplus))$vectors
v1/sum(v1[,1])
[,1] [,2] [,3] [,4]
[1,] 0.1546303+0i 0.2077628+0i 0.22145866-0.2393551i 0.22145866+0.2393551i
[2,] 0.3411990+0i -0.2120010+0i 0.27760504+0.1146278i 0.27760504-0.1146278i
[3,] 0.3496960+0i 0.2655525+0i -0.42631454+0.3321506i -0.42631454-0.3321506i
[4,] 0.1544747+0i -0.2610756+0i -0.07301833-0.2075529i -0.07301833+0.2075529i
A4=Aplus%*%Aplus%*%Aplus%*%Aplus
> A4%*%A4
[,1] [,2] [,3] [,4]
[1,] 0.1549889 0.3419902 0.3505069 0.1548329
[2,] 0.1551591 0.3423660 0.3508920 0.1550031
[3,] 0.1549995 0.3420136 0.3505309 0.1548435
[4,] 0.1550031 0.3420216 0.3505391 0.1548472
Aminus <- scan()
.300 .205 .285 .210
.322 .298 .078 .302
.248 .246 .298 .208
.177 .239 .292 .292
Aminus_matrix(Aminus,4,4,byrow=T)
eigen(t(Aminus))
$values
[1] 1.00000000+0.00000000i 0.11543346+0.00000000i 0.03628327+0.07470554i
[4] 0.03628327-0.07470554i
$vectors
[,1] [,2] [,3] [,4]
[1,] -0.7229627+0i -0.7235334+0i 0.4323611+0.44531554i 0.4323611-0.44531554i
[2,] -0.6799330+0i 0.3554384+0i -0.1760636+0.31000760i -0.1760636-0.31000760i
[3,] -0.6594832+0i -0.6487878+0i -0.6370600-0.72207947i -0.6370600+0.72207947i
[4,] -0.6982125+0i 1.0168828+0i 0.3807625-0.03324366i 0.3807625+0.03324366i
v2_eigen(t(Aminus))$vectors
> v2[,1]/sum(v2[,1])
[1] 0.2618869+0i 0.2462998+0i 0.2388920+0i 0.2529213+0i
> Aminus%*%Aminus%*%Aminus%*%Aminus%*%Aminus%*%Aminus%*%Aminus
[,1] [,2] [,3] [,4]
[1,] 0.2618870 0.2462997 0.2388921 0.2529212
[2,] 0.2618869 0.2462998 0.2388919 0.2529214
[3,] 0.2618869 0.2462998 0.2388920 0.2529213
[4,] 0.2618868 0.2462998 0.2388920 0.2529214
> round(log2(Aplus/Aminus),3)
[,1] [,2] [,3] [,4]
[1,] -0.737 0.419 0.580 -0.807
[2,] -0.913 0.304 1.813 -0.684
[3,] -0.623 0.463 0.332 -0.735
[4,] -1.164 0.571 0.395 -0.682
> piminus
[1] 0.262 0.246 0.239 0.253
> piplus_Re(round(v1[,1]/sum(v1[,1]),3))
> piplus
[1] 0.155 0.341 0.350 0.154
round(log2(piplus/piminus),3)
[1] -0.757 0.471 0.550 -0.716
First -Order Markov Chain for DNA Sequences¶
Consider a sequence of nucleotides in the following state: :math:`x = { A,C,G,G,C,C,A,G,T,A,C,C,G,G} ` Then,
Assume now that the dependence is of first order : Markovian:
this is by definition of a first order Markov Chain for DNA.
Note: For a Second -Order Markov Chain for DNA Sequences
could be written
Then, instead of working with single-position states, ie \(x = \{ A,C,G,G,C,C,A,G,T,A,C,C,G,G\}\) we could work with 2-position states, ie \(x = \{(A,C),(C,G),(G,G),(G,C),(C,C),(C,A),(A,G),(G,T),(T,A),(A,C),(C,C),(C,G),(G,G)\}\)
The Second-Order Markov Chain over the 4 elements \(\{A,C,G,T\}\) is equivalent to a First-Order Markov Chain over the 16 two- position states \((AA),(AG),(AC),(AT),(GA),(GG),(GC),(GT)\), etc.
Second Question: Given a Long Stretch of DNA, find the CpG Islands in it¶
- First Approach
Build the two First-Order Markov chains for the two regions, as
before.
Take windows of the DNA segment, e.g. 100 nucleotides long
Compute the log-odds for a window and check against the two Markov
models. May need to change the length of the window
Determine the regions with CpG Islands
- Second Approach: Integrate the Two Markov Models into One
The Hidden Markov Model used as the model¶
Distinguish the sequence of states from the sequence of symbols, we enrich the original states A,C,G,T with an island state so there are now 8 possible states A+, C+, etc.
Path \(\pi\) : The state sequence (specified, + or -, state of every nucleotide). \(\pi_i\) is the ith state in the path.
- The states in the path follow a simple Markov Chain
Remark: Beware the island states (+/-) do not necessarily follow a Markov chains.
- Transition Probabilities: \(a_{kl}\)
- Emissions The sequence of symbols (nucleotides of unspecified state, + or - ):
\(\{... AGGCATCCTA AGTCTGACGCGAGGCTTAC ...\}\)
- States and Symbols are decoupled
- Emission Probability: Probability of emitted symbol \(b\) -
\(e_k (b) = P (x_i = b \mid \pi_i = k )\) (= 0 or 1 for the CpG island problem)
Think of HMM as generative models that generate or emit sequences:
Example, Casino: Generate random sequences of rolls by * Simulating the successive choices of die (hidden Markov decision) * Rolling the chosen die (known probability) More generally:
- Choose an initial state \(\pi_1\) according to probability \(a_{0\pi(1)}\)
- Emit observation according to distribution \(e_k (b)=P(x_1=b\mid \pi_1 = k )\) for that state
- Then a new state \(\pi_2\) is chosen according to the transition probability \(a_{\pi(1)i}\) (+ to +, + to -, - to +, - to -)
The above processes generate sequences of random observations in which an overt process \((a_{\pi(1)i} )\) is combined with a hidden one (+ or -) Path \(\pi\) :
- Transition Probabilities : \(a_{kl} = P (\pi_i = l \mid \pi_{i-1} = k )\)
- Emissions : \(\{... AGGCATCCTA AGTCTGACGCGAGGCTTAC..\}\)
Emission Probability: Probability of emitted symbol, b \(e_k (b) = P (x_i = b \mid \pi_i = k )\)
- Joint Probability of an observed sequence of symbols, x, and a state
sequence, \(\pi\):
Example: Sequence of Emissions (Symbols): \(.... CGCG\) State Sequence (Path) \(......... C_+ G_- C_- G_+\) Joint Probability = \((a_{0,C_+} )*1*(a_{C_+,G_-})*1*(a_{G_-,C_-})*1*(a_{C_-,G_+})*1*(a_{C_+,0})\)
Example: The Occasionally Dishonest Casino. The Casino uses two dice, a well-balanced, fair, and an unbalanced, loaded, one, with the following probabilities:
Given: A sequence of nucleotides, e.g. \(CGCG\) The sequence of symbols \(\{CGCG \}\) can be generated from any of the following paths: \(\{C_+G_+C_+G_+\}\) \(\{C_-G_-C_-G_-\}\) \(\{C_+G_-C_+G_-\}\) \(\{C_-G_+C_-G_+\}\) \(\{C_-G_+C_+G_-\}\) \(\{C_+G_-C_-G_+\}\) with very different probabilities. Find : The sequence of the underlying states, i.e. The Path Solution : From the set of all possible state sequences, which can produce the sequence of the observed symbols, select the one which maximizes the joint probability of the given sequence of symbols, x, and associated sequence of states (Path), \(\pi\) , i.e.
If we think of the number of possible paths, if \(L\) is the length of the sequence, \(8^L\) becomes exponentially large, we need a new idea.
The Viterbi Algorithm¶
An extension of the dynamic programming idea of backward induction, this is made possible by the Markovian property. Notation can be a little confusing, the number of possible states will be S, (2, for the unfair casino, 3 for the weather: Sunny, cloudy, rainy, 8 for the CpG islands). The transitions between states is an \(S\times S\) stochastic matrix, we then have observables, in the unfair casino, the dice roll, in the weather example the states of the seaweed: soggy, dry, damp, wet. There may the sounds that are heard. For these observables we suppose we have a confusion matrix which is the matrix of probabilities of seeing a certain observable, given a state, \(P(6 \mid F)=1/6\), \(P(wet\mid rainy)\), in the specific case of the nucleotides we know for sure that if we see an A, the underlying state can either be a A+ or A-, the other probabilities are 0. We saw that we denote the emission probabilities by indexing by the state: \(e_{s}(o)=P(\mbox{see o}\mid \mbox{state is } s)\).
We compute partial probabilities up to a point.
We call \(v_k(i)\) the most probable path ending in state k with observation \(i\) equal to \(x_i\), for all possible states k.
Calculation of \(v_Z(i)\), the most probable path to the state \(Z\) will have to path through A, B, or C at step \(i-1\). The most probable path to Z at step \(i\) will be either:
...sequence of states ending in A,Z ...sequence of states ending in B,Z ...sequence of states ending in C,Z
Viterbi Algorithm
We want the path ending in A,Z , B,Z or C,Z with an observation of X at step Z that has the highest (complete)probability. The path going through A has probability
The other two probabilities are the same with A replaced by B or C. We have to compute these three probabilities and take the maximum one:
Step-0: Initialize (i=0): \(v_0 (0) = 1, v_k(0) = 0\) for \(k>0\)
Steps- \(i=1..L\): \(v_l(i)=e_l(x_i ) max_k \{v_k(i-1) a_{kl} \}\)
* The Viterbi algorithm employs the strategy of Dynamic Programming * Probabilities should be expressed in a log space to avoid underflow errors *We need a backward pointer to the state \(k^*\) that gave the \(v_\ell(i)\) max, the backward pointer doesn’t need the emission probabilities, since for the same state \(\ell\) the emission probabilities are the same, so the pointer is
Actual Example: Three states of the weather Sun, Cloudy, Rainy with transition matrix
We saw Brittle, Damp, Soggy... What was the weather over those 3 days? Initializations with first observation Brittle:
Step one: State 1: sunny. \(P(sunny\mid sunny) \times P(Damp\mid sunny) \times 0.378=0.5 * 0.15 * 0.378= 0.02835\) \(P(sunny\mid cloudy) \times P(Damp\mid sunny) \times 0.0.0425=0.375 * 0.15 * 0.0425= 0.00239\) \(P(sunny\mid rainy) \times P(Damp\mid sunny) \times 0.0.001=0.125 * 0.15 *0.01= 0.0001875\)
Step two:
Advantages of Viterbi: Computational complexity is reduced. Not deterred by noise in the middle, looks at the whole path before deciding which is the best.
Disadvantages: We don’t get to see any slightly suboptimal paths. We will remedy this, when we look at the Gibbs sampler.
Forward Algorithm¶
We have a sequence \(x\) for which we would like to compute the probability:
In fact we can compute this with a similar algorithm where the maximum is replaced by sums:
For a sequence of length L :\(P(x)=\sum_k f_k(L)\).
Backward Algorithm¶
Suppose we observe a sequence \(x\) and want to evaluate \(P(\pi_i=k\mid x)\).
The second term is called \(b_k(i)\) and the first is \(b_k(i)\), can be computed backwards from the ending sequence, in the same way the forward algorithm worked.
Estimation of the Hidden Markov Model¶
Parameters¶
If a state sequence is known, we have an observed matrix of counts of the number of times j followed i in the matrix \(A_{ij}\) and the number of counts that state s emitted observable \(l\) as a matrix \(E_{sl}\), maximum likelihood theory gives as estimates for a and e:
We may have to use pseudocounts (equivalent to a Dirichlet prior) if the numbers are small or zero.
Estimation of the Hidden Markov Model¶
Parameters¶
If a state sequence is known, we have an observed matrix of counts of the number of times j followed i in the matrix \(A_{ij}\) and the number of counts that state s emitted observable \(l\) as a matrix \(E_{sl}\), maximum likelihood theory gives as estimates for a and e:
We may have to use pseudocounts (equivalent to a Dirichlet prior) if the numbers are small or zero.
Baum-Welch , or EM specialized to HMM¶
We want to find the parameters (the model) that maximizes
\(x\) means all the observations, including all the different sequences \(x^j, j=1...n\) We build a sequence of models \(\theta^t\), the missing data are the paths \(\pi\). The BW algorithm calculates \(A_{kl}\) and \(E_k(b)\) as the expected number of times each transition/emission occurs, given the training sequences. :math:`` P(aklused at position \(i\) in sequence \(x\)) =P(i=k,i+1=lx,) =fk(i)aklel(xi+1)bl(i+1)P(x)
thus the estimate by expectation(averaging) is:
Initialize at arbitrary parameter values For each sequence \(j=1\ldots n\):
- Calculate \(f_k(i)\) for sequence \(j\) using the forward algorithm.
- Calculate \(b_k(i)\) for sequence \(j\) using the backward algorithm.
- Add the contribution of sequence \(j\) to \(A\) and \(E\).
Calculate the new model parameters from the new A and E. Calculate the new log likelihood of the model. Stop when the incremental change is small, or that there are enough iterations.
Structure of HMMs¶
Prior knowledge should be used as optimally as possible.
Duration¶
* Length of extent of model: staying with the same nucleotide frequencies for instance:
- Exponentially decaying length (geometric) \(P (\mbox{k residues}) = (1-p)p^{k-1}\)
- Defined range of length, e.g., Model with distribution of lengths
between 2 and 8
- Non-geometric length distribution, e.g., array of \(n\) states
* Silent (or, Null) states, e.g. Begin, End states Do not emit any symbols
Are used to simplify the models, reduce the number of transition probabilities to estimate from training data.
No loops consisting entirely of silent states are allowed. All HMM-related algorithms can be easily adjusted to incorporate silent states.
Table Of Contents
- Crash Course in Genomics
- Seeing the structure and the surprises
- Sequence Data are Strings
- Markov Chains
- Hidden Markov Models - the Unfair Casino
- Hidden Markov Model for CpG islands
- Underlying Algorithms
- PCA : Interpretation Examples
- PCA, SVD and MDS
- Binary Table/Contingency Table representation
- Binary Data and Graphs
- Modelling Mixtures: the Dirichlet distribution
- Gibbs Sampler Examples
- Microarrays
- HW1 Stats 366 - Stats 166
- Lab1 Reading and Manipulating genomes in R
- Lab2/Hw 2 Hidden Markov Models
- Lab3/Hw 3 Visualization and PCA
- Lab4 Networks
- Lab 5: Phylogenetic Trees and the Bootstrap
Search
Enter search terms or a module, class or function name.