筆記首先總結(jié)概括HMM相關(guān)理論知識,再通過生信序列的R語言實操對HMM深入了解
1、MM,Markov model 馬爾科夫模型
(1)天氣舉例
- 假設(shè)只有晴、陰、雨三種天氣情況。某地周一為晴天,已知若晴天后,第二天出現(xiàn)晴陰雨的概率分別為0.5,0.3,0.2(sum=1)。
- 例如周二為雨天的概率即為0.2。而已知若雨天后,第二天出現(xiàn)晴陰雨的概率分別為0.1,0.6,0.3(sum=1),可預(yù)測周三為陰天的概率為0.2*0.3=0.06。
- 繼續(xù)預(yù)測,已知若陰天后,第二天出現(xiàn)晴陰雨的概率分別為0.3,0.4,0.3(sum=1),可預(yù)測周四出現(xiàn)晴天的概率為0.20.30.4=0.024;
- 綜上在已知假設(shè)及周一天氣情況下,周一~周四天氣為晴陰雨晴的概率為0.024,還可以繼續(xù)預(yù)測接下來一周的天氣情況概率。
(2)Markov chain馬爾可夫鏈
- 如上假設(shè),即為一個簡單的馬爾科夫鏈。首先來說,它是一個離散性質(zhì)的時序模型,即在順序離散過程中,一個時間點(周幾)代表一種特定的狀態(tài)(天氣)。
-
其次馬爾科夫鏈最重要的假設(shè)是:當(dāng)前的狀態(tài)只和上一時刻有關(guān),與上一時刻之前的任何狀態(tài)都沒有關(guān)系。反映到例子中就是今天的天氣情況只取決于昨天的天氣情況,與前天,上周的天氣都無關(guān)。
反映到如下公式中,t+1時間點的狀態(tài)可能分布只與t時間點的狀態(tài)有關(guān)。
MM
(3)Transition Matrix狀態(tài)轉(zhuǎn)移概率矩陣?。。▌澲攸c)
- stata狀態(tài):馬爾可夫鏈在狀態(tài)空間內(nèi)的取值稱為狀態(tài),所有的取值情況稱為狀態(tài)空間 stata space。簡單理解就是例子中的三種天氣(晴陰雨);
-
狀態(tài)轉(zhuǎn)移概率矩陣即是指由一種特定狀態(tài)分別轉(zhuǎn)移到所有狀態(tài)的概率transition probability(和必然為1),假設(shè)存在n種狀態(tài),則Transition Matrix就是n*n的矩陣。γij的值就表示狀態(tài) i 轉(zhuǎn)移到狀態(tài) j 的概率。
TM
TM -
Transition Matrix的圖形展示
TM,from web
(4)initial probabilities of the states
- 上述例子中是給定周一天氣為晴天,更多情況下馬爾科夫鏈的開端也是狀態(tài)的概率分布π;
-
Often, the initial probabilities of the states πi are assumed to be the stationary distribution implied by the transition probability matrix。
即趨近于無窮時刻下,三種狀態(tài)的分布的概率。
initial probabilities of the states
2、R代碼實操:馬爾科夫模型預(yù)測DNA序列
(1)已知條件
A:特殊的序列
For some DNA sequences, the probability of finding a particular nucleotide at a particular position in the sequence does depend on what nucleotides are found at adjacent positions in the sequence.
即符合馬爾可夫鏈的假設(shè)
B:轉(zhuǎn)移矩陣

C:初始概率
可任意指定,例如T 0.3 C 0.3 G 0.2 A 0.2
(2)R實操
目的:主要根據(jù)轉(zhuǎn)移矩陣預(yù)測可能的這樣一段特殊序列
- 制作TM
nucleotides <- c("A", "C", "G", "T")
afterAprobs <- c(0.2, 0.3, 0.3, 0.2) # Set the values of the probabilities, where the previous nucleotide was "A"
afterCprobs <- c(0.1, 0.41, 0.39, 0.1) # Set the values of the probabilities, where the previous nucleotide was "C"
afterGprobs <- c(0.25, 0.25, 0.25, 0.25) # Set the values of the probabilities, where the previous nucleotide was "G"
afterTprobs <- c(0.5, 0.17, 0.17, 0.17) # Set the values of the probabilities, where the previous nucleotide was "T"
mytransitionmatrix <- matrix(c(afterAprobs, afterCprobs, afterGprobs, afterTprobs), 4, 4, byrow = TRUE) # Create a 4 x 4 matrix
rownames(mytransitionmatrix) <- nucleotides
colnames(mytransitionmatrix) <- nucleotides
mytransitionmatrix

- 編寫函數(shù)
主要基于第一個起始堿基的概率與TM概率分布,利用sample()抽樣函數(shù)進行實現(xiàn)序列預(yù)測
generatemarkovseq <- function(transitionmatrix, initialprobs, seqlength)
{
nucleotides <- c("A", "C", "G", "T") # Define the alphabet of nucleotides
mysequence <- character() # Create a vector for storing the new sequence
# Choose the nucleotide for the first position in the sequence:
firstnucleotide <- sample(nucleotides, 1, rep=TRUE, prob=initialprobs)
mysequence[1] <- firstnucleotide # Store the nucleotide for the first position of the sequence
for (i in 2:seqlength)
{
prevnucleotide <- mysequence[i-1] # Get the previous nucleotide in the new sequence
# Get the probabilities of the current nucleotide, given previous nucleotide "prevnucleotide":
probabilities <- transitionmatrix[prevnucleotide,]
# Choose the nucleotide at the current position of the sequence:
nucleotide <- sample(nucleotides, 1, rep=TRUE, prob=probabilities)
mysequence[i] <- nucleotide # Store the nucleotide for the current position of the sequence
}
return(mysequence)
}
- 運行函數(shù)
myinitialprobs <- c(0.3, 0.3, 0.2, 0.2) #對應(yīng)函數(shù)體設(shè)置的"A", "C", "G", "T"順序
generatemarkovseq(mytransitionmatrix, myinitialprobs, 30)
table(generatemarkovseq(mytransitionmatrix, myinitialprobs, 30))
result

參考文章
1、Hidden Markov Models — Bioinformatics 0.1 documentation
2、01 隱馬爾可夫模型 - 馬爾可夫鏈、HMM參數(shù)和性質(zhì) - 簡書
3、Multilevel HMM tutorial
4、馬爾可夫鏈_百度百科




