本文是對前幾天的打臉
打臉之處就在于發(fā)現(xiàn)了muscle的R包，從此不用再為序列比對東奔西跑——

Step 1. 從基因名拿到序列

sub1 <- c("HDAC1","HDAC2","HDAC3","HDAC8")
sub2 <- c("HDAC4","HDAC5","HDAC6","HDAC7","HDAC9","HDAC10")
sub3 <- paste0("SIRT",1:7)
sub4 <- "HDAC11"
allsymb <- c(sub1,sub2,sub3,sub4)

這一步用到的包：biomaRt ，選擇用symbol獲取蛋白質(zhì)序列。

library(biomaRt)
mart <- useMart("ensembl","hsapiens_gene_ensembl")
all_ppseqs <- getSequence(id = allsymb,
                          type="hgnc_symbol",
                          seqType="peptide",
                          mart = mart)

檢查一下，發(fā)現(xiàn)少了SIRT6.

table(all_ppseqs$hgnc_symbol)
# 
#  HDAC1 HDAC10 HDAC11  HDAC2  HDAC3  HDAC4  HDAC5  HDAC6  HDAC7  HDAC8  HDAC9  SIRT1  SIRT2 
#      3      6     16     12      4      8      8     18     23     45     18      4     14 
#  SIRT3  SIRT4  SIRT5  SIRT7 
#     12      3      7      4

會出現(xiàn)部分基因名匹配不到序列的情況，一般用 ENTREZ ID 可以避免這種情況。于是現(xiàn)在需要補上SIRT6的序列。

library(org.Hs.eg.db)
sirt6entrez <- AnnotationDbi::select(org.Hs.eg.db,"SIRT6","ENTREZID","SYMBOL")
sirt6entrez
#   SYMBOL ENTREZID
# 1  SIRT6    51548
sirt6_ppseqs <- getSequence(id = "51548",
                            type="entrezgene_id",
                            seqType="peptide",
                            mart = mart)

再和其他序列合并成一個數(shù)據(jù)框，去除unavailable序列。

colnames(sirt6_ppseqs)[2] <- "hgnc_symbol" 
sirt6_ppseqs[,2] <- "SIRT6"
allppseqs <- rbind(all_ppseqs,sirt6_ppseqs)
allppseqs <- allppseqs[allppseqs$peptide!="Sequence unavailable",]
allppseqs <- allppseqs[order(allppseqs$hgnc_symbol),]

給重復(fù)的基因名重命名：

index <- duplicated(allppseqs$hgnc_symbol)
i <- 1
for(j in 1:nrow(allppseqs)){
if(index[j]==FALSE){
  i = 1
  allppseqs[j,2] <- paste0(allppseqs[j,2], ".",i)
  }
  else {
    i = i+1
    allppseqs[j,2] <- paste0(allppseqs[j,2], ".",i)
  } 
}

輸出為FASTA文件：

exportFASTA(allppseqs,"allppseqs.fasta")

Step 2. 多序列比對及繪制進化樹

用到的包：Biostrings muscle ape

library(Biostrings)
library(muscle)
library(ape)

MUSCLE 算法的特點之一就是快，采用了k-mer的全局/成對比對方法，最后得出計分矩陣。
MUSCLE 在EBI的網(wǎng)頁版工具傳送門?? https://www.ebi.ac.uk/Tools/msa/muscle/
MUSCLE 軟件版?zhèn)魉烷T?? http://www.drive5.com/muscle/downloads.htm

The MUSCLE algorithm is a progressive alignment method that works with DNA, RNA, and amino acidsequences producing high-accuracy alignments with very fast computational times (Edgar, 2004,a). The algorithm is iterative, with later iterations refining the earlier alignments. In each iteration, pairwise alignment scores are computed for all sequence pairs (based on k-mer counting or global pairwise alignments) and the values are entered into a triangular distance matrix.

讀入FASTA文件：

myseq <- readAAStringSet("allppseqs.fasta", format="fasta",
                          nrec=-1L, skip=0L, seek.first.rec=FALSE, use.names=TRUE)

muscle多序列比對：

aln <- muscle::muscle(myseq)

這一步一般比較費時。

當然 muscle() 有很多參數(shù)可以設(shè)置，提速或設(shè)定空位罰分等等：

diags = TRUE 速度提高自然意味著準確度下降

Enhanced speed. To enhance the speed of the algorithm, the diags = TRUE flag will optimize the speed with a potential loss of accuracy.
gapopen = -30 空位罰分

Gap penalties. Default gap penalties can be modified to produce altered alignments. The gap penalty must be negative, with larger negative values indicating more stringent penalties.
maxhours = 24.0 比對運行的最大時限

Maximum number of hours. If an alignment is expected to take a long time, a maximum total numberof hours can be specified, which, if reached, will lead to the algorithm stopping at this point and returningthe current alignment.

aln
detail(aln)

修剪序列：

auto <- maskGaps(aln, min.fraction=0.5, min.block.width=4)
auto

除此之外還可以用 rowmask() colmask() 手動設(shè)置masking區(qū)域。

畫樹！

sdist <- stringDist(as(aotu,"AAStringSet"), method="hamming") 
clust <- hclust(sdist, method = "single")
plot(as.phylo(clust),type="fan",cex = 0.8)

References

A guide to using muscle https://bioconductor.org/packages/release/bioc/vignettes/muscle/inst/doc/muscle-vignette.pdf
plot單獨畫出pheatmap返回的聚類結(jié)果（聚類樹）http://www.omicsclass.com/article/510

最后，向大家隆重推薦生信技能樹的一系列干貨！

生信技能樹全球公益巡講：https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
B站公益74小時生信工程師教學視頻合輯：https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
招學徒：https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

不出Rstudio, 實現(xiàn)從多序列比對到畫進化樹

不出Rstudio, 實現(xiàn)從多序列比對到畫進化樹

Step 1. 從基因名拿到序列

Step 2. 多序列比對及繪制進化樹

References

最后，向大家隆重推薦生信技能樹的一系列干貨！

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

不出Rstudio, 實現(xiàn)從多序列比對到畫進化樹

Step 1. 從基因名拿到序列

Step 2. 多序列比對及繪制進化樹

References

最后，向大家隆重推薦生信技能樹的一系列干貨！

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

最后，向大家隆重推薦生信技能樹的一系列干貨！