HMM 隱馬爾可夫模型初學(xué)(二)

1、HMM,Hidden Markov model 隱馬爾科夫模型

(1)天氣舉例

  • 假設(shè)不能直接觀察天氣陰晴雨情況,只能看到地面的潮濕情況(假如分為非常潮濕,一般潮濕,不潮濕三種對(duì)應(yīng)A,B,C三種評(píng)級(jí))?,F(xiàn)在我一連觀察了一周的地面潮濕情況(AABBCBA),是否能夠判斷這一周的天氣?
  • 如上所述,有兩類狀態(tài):一類是地面潮濕狀態(tài) observation stata(A、B、C);一類是天氣情況 latent stata(陰晴雨);
  • 之前已經(jīng)知道了天氣的狀態(tài)轉(zhuǎn)移概率矩陣,現(xiàn)在主要就需要知道天氣情況與地面潮濕情況的對(duì)應(yīng)關(guān)系,從而預(yù)測(cè)天氣情況。

(2)Hidden Markov model

HMM理解
  • HMM隨機(jī)生成的狀態(tài)隨機(jī)序列被稱為狀態(tài)序列;每個(gè)狀態(tài)生成一個(gè)觀測(cè),由此產(chǎn)生的觀測(cè)隨機(jī)序列,被稱為觀測(cè)序列。
  • HMM是一個(gè)雙重隨機(jī)過(guò)程---具有一定狀態(tài)的隱馬爾可夫鏈(天氣情況)和隨機(jī)的觀測(cè)序列(地面潮濕情況)。
  • The HMM is a discrete time model: for each point in time t, we have one hidden state that generates one observed event for that time point t.


    HMM
HMM應(yīng)用條件
  • The hidden states follow a Markov process, i.e., the states over time are not independent of one another, but the current state depends on the previous state only (and not on earlier states)
    即MM馬爾科夫的基本假設(shè),當(dāng)前狀態(tài)只與上一狀態(tài)有關(guān)


    assumption1
  • The distribution generating the observation depends on the state of an underlying, hidden state,。
    某時(shí)刻的觀測(cè)結(jié)果只與當(dāng)前時(shí)刻對(duì)應(yīng)的隱狀態(tài)有關(guān)


    assumption2
發(fā)射矩陣 emission distribution matrix
  • 即每種隱狀態(tài)對(duì)應(yīng)的,可能出現(xiàn)的觀測(cè)結(jié)果的概率情況。
  • 例如晴天天氣里,地面出現(xiàn)A、B、C三種情況的概率分別為0.1,0.3,0.6;陰天分別為0.2,0.5,0.3;雨天分別為0.5,0.3,0.2


    EM
HMM三要素
  • 綜上,以及MM知識(shí)點(diǎn),描述HMM,需要三大參數(shù)


    three sets of parameters
  • 此外還有隱狀態(tài)集合,預(yù)測(cè)值集合
    關(guān)于預(yù)測(cè)值,一般以分類類型結(jié)果為主;
    關(guān)于隱狀態(tài),我認(rèn)為在實(shí)際探索的過(guò)程中往往并不清楚狀態(tài)的含義,而是根據(jù)預(yù)測(cè)值及專業(yè)知識(shí)分析狀態(tài)意義。例如我把天氣視為觀測(cè)結(jié)果,那么隱狀態(tài)可以是梅雨期?臺(tái)風(fēng)期?甚至可以說(shuō)是太陽(yáng)公公心情的好壞狀態(tài)。
HMM模型可以解決的三類問(wèn)題?。?!劃重點(diǎn)

(1)概率計(jì)算問(wèn)題

  • 給定模型λ=(A,B,π)和觀測(cè)序列Q={q1,q2,...,qT},計(jì)算模型λ下觀測(cè)到序列Q出現(xiàn)的概率P(Q|λ);
  • 例如給定天氣與地面潮濕情況的HMM三要素,及一組地面潮濕情況數(shù)據(jù)。計(jì)算這組數(shù)據(jù)的出現(xiàn)概率是多大
  • 方法:前向-后向算法

(2)學(xué)習(xí)問(wèn)題

  • 已知觀測(cè)序列Q={q1,q2,...,qT},估計(jì)模型λ=(A,B,π)的參數(shù),使得在該模型下觀測(cè)序列P(Q|λ)最大;
  • 例如已知一組地面潮濕情況數(shù)據(jù),估計(jì)天氣與地面潮濕情況的HMM三要素,從而使出現(xiàn)這組數(shù)據(jù)的可能性最大;
  • 方法:Maximum likelihood, Expectation Maximization or Baum-Welch algorithm, and Bayesian estimation

(3)預(yù)測(cè)問(wèn)題

  • 給定模型λ=(A,B,π)和觀測(cè)序列Q={q1,q2,...,qT},求給定觀測(cè)序列條件概率P(I|Q,λ)最大的狀態(tài)序列I。
  • 例如已知一周的地面潮濕情況數(shù)據(jù),并且已知天氣與地面潮濕情況的HMM三要素,從而估計(jì)這一周最有可能的天氣情況
  • 方法:Viterbi algorithm

個(gè)人認(rèn)為在實(shí)驗(yàn)探索中,觀察預(yù)測(cè)值比較容易獲得,由此學(xué)習(xí)建模,估計(jì)參數(shù)。然后根據(jù)模型結(jié)果,進(jìn)行其它觀察結(jié)果的隱狀態(tài)序列的預(yù)測(cè)。

2、R代碼實(shí)操

  • 已知某特殊DNA序列存在兩個(gè)區(qū)(hidden stata),分別是AT-rich與CG-rich。

  • 前者富含AT堿基,后者CG堿基,具體堿基概率分布(emission matrix)如下圖。


    HMM
  • 此外也知道了初始狀態(tài)分布概率以及狀態(tài)概率轉(zhuǎn)移矩陣,據(jù)此隨機(jī)生成一段符合要求的堿基序列。

  • 第一步:準(zhǔn)備TM、EM

states              <- c("AT-rich", "GC-rich") # Define the names of the states
ATrichprobs         <- c(0.7, 0.3)             # Set the probabilities of switching states, where the previous state was "AT-rich"
GCrichprobs         <- c(0.1, 0.9)             # Set the probabilities of switching states, where the previous state was "GC-rich"
thetransitionmatrix <- matrix(c(ATrichprobs, GCrichprobs), 2, 2, byrow = TRUE) # Create a 2 x 2 matrix
rownames(thetransitionmatrix) <- states
colnames(thetransitionmatrix) <- states
thetransitionmatrix     

nucleotides         <- c("A", "C", "G", "T")   # Define the alphabet of nucleotides
ATrichstateprobs    <- c(0.39, 0.1, 0.1, 0.41) # Set the values of the probabilities, for the AT-rich state
GCrichstateprobs    <- c(0.1, 0.41, 0.39, 0.1) # Set the values of the probabilities, for the GC-rich state
theemissionmatrix <- matrix(c(ATrichstateprobs, GCrichstateprobs), 2, 4, byrow = TRUE) # Create a 2 x 4 matrix
rownames(theemissionmatrix) <- states
colnames(theemissionmatrix) <- nucleotides
theemissionmatrix   
  • 第二步:編寫函數(shù)
# Function to generate a DNA sequence, given a HMM and the length of the sequence to be generated.
generatehmmseq <- function(transitionmatrix, emissionmatrix, initialprobs, seqlength)
{
  nucleotides     <- c("A", "C", "G", "T")   # Define the alphabet of nucleotides
  states          <- c("AT-rich", "GC-rich") # Define the names of the states
  mysequence      <- character()             # Create a vector for storing the new sequence
  mystates        <- character()             # Create a vector for storing the state that each position in the new sequence
  # was generated by
  # Choose the state for the first position in the sequence:
  firststate      <- sample(states, 1, rep=TRUE, prob=initialprobs)
  # Get the probabilities of the current nucleotide, given that we are in the state "firststate":
  probabilities   <- emissionmatrix[firststate,]
  # Choose the nucleotide for the first position in the sequence:
  firstnucleotide <- sample(nucleotides, 1, rep=TRUE, prob=probabilities)
  mysequence[1]   <- firstnucleotide         # Store the nucleotide for the first position of the sequence
  mystates[1]     <- firststate              # Store the state that the first position in the sequence was generated by
  
  for (i in 2:seqlength)
  {
    prevstate    <- mystates[i-1]           # Get the state that the previous nucleotide in the sequence was generated by
    # Get the probabilities of the current state, given that the previous nucleotide was generated by state "prevstate"
    stateprobs   <- transitionmatrix[prevstate,]
    # Choose the state for the ith position in the sequence:
    state        <- sample(states, 1, rep=TRUE, prob=stateprobs)
    # Get the probabilities of the current nucleotide, given that we are in the state "state":
    probabilities <- emissionmatrix[state,]
    # Choose the nucleotide for the ith position in the sequence:
    nucleotide   <- sample(nucleotides, 1, rep=TRUE, prob=probabilities)
    mysequence[i] <- nucleotide             # Store the nucleotide for the current position of the sequence
    mystates[i]  <- state                   # Store the state that the current position in the sequence was generated by
  }
  
  for (i in 1:length(mysequence))
  {
    nucleotide   <- mysequence[i]
    state        <- mystates[i]
    print(paste("Position", i, ", State", state, ", Nucleotide = ", nucleotide))
  }
}
  • 第三步:給定初始狀態(tài)概率進(jìn)行預(yù)測(cè)
theinitialprobs <- c(0.5, 0.5)
generatehmmseq(thetransitionmatrix, theemissionmatrix, theinitialprobs, 30)
result

R實(shí)操代碼本身意義可能不大,但對(duì)于我們具體了解HMM模型很有幫助。在具體應(yīng)用HMM模型時(shí),更多的是采用相應(yīng)的R包進(jìn)行分析。這類R包有不少,會(huì)挑選幾個(gè)進(jìn)行示例學(xué)習(xí)。


參考文章
1、Hidden Markov Models — Bioinformatics 0.1 documentation
2、01 隱馬爾可夫模型 - 馬爾可夫鏈、HMM參數(shù)和性質(zhì) - 簡(jiǎn)書
3、Multilevel HMM tutorial
4、馬爾可夫鏈_百度百科

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容