不出Rstudio, 實現(xiàn)從多序列比對到畫進化樹

本文是對前幾天的打臉
打臉之處就在于發(fā)現(xiàn)了muscle的R包,從此不用再為序列比對東奔西跑——

Step 1. 從基因名拿到序列

sub1 <- c("HDAC1","HDAC2","HDAC3","HDAC8")
sub2 <- c("HDAC4","HDAC5","HDAC6","HDAC7","HDAC9","HDAC10")
sub3 <- paste0("SIRT",1:7)
sub4 <- "HDAC11"
allsymb <- c(sub1,sub2,sub3,sub4)

這一步用到的包:biomaRt ,選擇用symbol獲取蛋白質(zhì)序列。

library(biomaRt)
mart <- useMart("ensembl","hsapiens_gene_ensembl")
all_ppseqs <- getSequence(id = allsymb,
                          type="hgnc_symbol",
                          seqType="peptide",
                          mart = mart)

檢查一下,發(fā)現(xiàn)少了SIRT6.

table(all_ppseqs$hgnc_symbol)
# 
#  HDAC1 HDAC10 HDAC11  HDAC2  HDAC3  HDAC4  HDAC5  HDAC6  HDAC7  HDAC8  HDAC9  SIRT1  SIRT2 
#      3      6     16     12      4      8      8     18     23     45     18      4     14 
#  SIRT3  SIRT4  SIRT5  SIRT7 
#     12      3      7      4 

會出現(xiàn)部分基因名匹配不到序列的情況,一般用 ENTREZ ID 可以避免這種情況。于是現(xiàn)在需要補上SIRT6的序列。

library(org.Hs.eg.db)
sirt6entrez <- AnnotationDbi::select(org.Hs.eg.db,"SIRT6","ENTREZID","SYMBOL")
sirt6entrez
#   SYMBOL ENTREZID
# 1  SIRT6    51548
sirt6_ppseqs <- getSequence(id = "51548",
                            type="entrezgene_id",
                            seqType="peptide",
                            mart = mart)

再和其他序列合并成一個數(shù)據(jù)框,去除unavailable序列。

colnames(sirt6_ppseqs)[2] <- "hgnc_symbol" 
sirt6_ppseqs[,2] <- "SIRT6"
allppseqs <- rbind(all_ppseqs,sirt6_ppseqs)
allppseqs <- allppseqs[allppseqs$peptide!="Sequence unavailable",]
allppseqs <- allppseqs[order(allppseqs$hgnc_symbol),]

給重復(fù)的基因名重命名:

index <- duplicated(allppseqs$hgnc_symbol)
i <- 1
for(j in 1:nrow(allppseqs)){
if(index[j]==FALSE){
  i = 1
  allppseqs[j,2] <- paste0(allppseqs[j,2], ".",i)
  }
  else {
    i = i+1
    allppseqs[j,2] <- paste0(allppseqs[j,2], ".",i)
  } 
}

輸出為FASTA文件:

exportFASTA(allppseqs,"allppseqs.fasta")

Step 2. 多序列比對及繪制進化樹

用到的包:Biostrings muscle ape

library(Biostrings)
library(muscle)
library(ape)

MUSCLE 算法的特點之一就是快,采用了k-mer的全局/成對比對方法,最后得出計分矩陣。
MUSCLE 在EBI的網(wǎng)頁版工具傳送門?? https://www.ebi.ac.uk/Tools/msa/muscle/
MUSCLE 軟件版?zhèn)魉烷T?? http://www.drive5.com/muscle/downloads.htm

The MUSCLE algorithm is a progressive alignment method that works with DNA, RNA, and amino acidsequences producing high-accuracy alignments with very fast computational times (Edgar, 2004,a). The algorithm is iterative, with later iterations refining the earlier alignments. In each iteration, pairwise alignment scores are computed for all sequence pairs (based on k-mer counting or global pairwise alignments) and the values are entered into a triangular distance matrix.

讀入FASTA文件:

myseq <- readAAStringSet("allppseqs.fasta", format="fasta",
                          nrec=-1L, skip=0L, seek.first.rec=FALSE, use.names=TRUE)

muscle多序列比對:

aln <- muscle::muscle(myseq)

這一步一般比較費時。

當然 muscle() 有很多參數(shù)可以設(shè)置,提速或設(shè)定空位罰分等等:

  • diags = TRUE 速度提高自然意味著準確度下降

    Enhanced speed. To enhance the speed of the algorithm, the diags = TRUE flag will optimize the speed with a potential loss of accuracy.

  • gapopen = -30 空位罰分

    Gap penalties. Default gap penalties can be modified to produce altered alignments. The gap penalty must be negative, with larger negative values indicating more stringent penalties.

  • maxhours = 24.0 比對運行的最大時限

    Maximum number of hours. If an alignment is expected to take a long time, a maximum total numberof hours can be specified, which, if reached, will lead to the algorithm stopping at this point and returningthe current alignment.

aln
detail(aln) 

修剪序列:

auto <- maskGaps(aln, min.fraction=0.5, min.block.width=4)
auto

除此之外還可以用 rowmask() colmask() 手動設(shè)置masking區(qū)域。

畫樹!

sdist <- stringDist(as(aotu,"AAStringSet"), method="hamming") 
clust <- hclust(sdist, method = "single")
plot(as.phylo(clust),type="fan",cex = 0.8) 

References


最后,向大家隆重推薦生信技能樹的一系列干貨!

  1. 生信技能樹全球公益巡講:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
  2. B站公益74小時生信工程師教學視頻合輯:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
  3. 招學徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容