本文是對前幾天的打臉
打臉之處就在于發(fā)現(xiàn)了muscle的R包,從此不用再為序列比對東奔西跑——
Step 1. 從基因名拿到序列

sub1 <- c("HDAC1","HDAC2","HDAC3","HDAC8")
sub2 <- c("HDAC4","HDAC5","HDAC6","HDAC7","HDAC9","HDAC10")
sub3 <- paste0("SIRT",1:7)
sub4 <- "HDAC11"
allsymb <- c(sub1,sub2,sub3,sub4)
這一步用到的包:biomaRt ,選擇用symbol獲取蛋白質(zhì)序列。
library(biomaRt)
mart <- useMart("ensembl","hsapiens_gene_ensembl")
all_ppseqs <- getSequence(id = allsymb,
type="hgnc_symbol",
seqType="peptide",
mart = mart)
檢查一下,發(fā)現(xiàn)少了SIRT6.
table(all_ppseqs$hgnc_symbol)
#
# HDAC1 HDAC10 HDAC11 HDAC2 HDAC3 HDAC4 HDAC5 HDAC6 HDAC7 HDAC8 HDAC9 SIRT1 SIRT2
# 3 6 16 12 4 8 8 18 23 45 18 4 14
# SIRT3 SIRT4 SIRT5 SIRT7
# 12 3 7 4
會出現(xiàn)部分基因名匹配不到序列的情況,一般用 ENTREZ ID 可以避免這種情況。于是現(xiàn)在需要補上SIRT6的序列。
library(org.Hs.eg.db)
sirt6entrez <- AnnotationDbi::select(org.Hs.eg.db,"SIRT6","ENTREZID","SYMBOL")
sirt6entrez
# SYMBOL ENTREZID
# 1 SIRT6 51548
sirt6_ppseqs <- getSequence(id = "51548",
type="entrezgene_id",
seqType="peptide",
mart = mart)
再和其他序列合并成一個數(shù)據(jù)框,去除unavailable序列。

colnames(sirt6_ppseqs)[2] <- "hgnc_symbol"
sirt6_ppseqs[,2] <- "SIRT6"
allppseqs <- rbind(all_ppseqs,sirt6_ppseqs)
allppseqs <- allppseqs[allppseqs$peptide!="Sequence unavailable",]
allppseqs <- allppseqs[order(allppseqs$hgnc_symbol),]
給重復(fù)的基因名重命名:
index <- duplicated(allppseqs$hgnc_symbol)
i <- 1
for(j in 1:nrow(allppseqs)){
if(index[j]==FALSE){
i = 1
allppseqs[j,2] <- paste0(allppseqs[j,2], ".",i)
}
else {
i = i+1
allppseqs[j,2] <- paste0(allppseqs[j,2], ".",i)
}
}

輸出為FASTA文件:
exportFASTA(allppseqs,"allppseqs.fasta")
Step 2. 多序列比對及繪制進化樹
用到的包:Biostrings muscle ape
library(Biostrings)
library(muscle)
library(ape)
MUSCLE 算法的特點之一就是快,采用了k-mer的全局/成對比對方法,最后得出計分矩陣。
MUSCLE 在EBI的網(wǎng)頁版工具傳送門?? https://www.ebi.ac.uk/Tools/msa/muscle/
MUSCLE 軟件版?zhèn)魉烷T?? http://www.drive5.com/muscle/downloads.htm
The MUSCLE algorithm is a progressive alignment method that works with DNA, RNA, and amino acidsequences producing high-accuracy alignments with very fast computational times (Edgar, 2004,a). The algorithm is iterative, with later iterations refining the earlier alignments. In each iteration, pairwise alignment scores are computed for all sequence pairs (based on k-mer counting or global pairwise alignments) and the values are entered into a triangular distance matrix.
讀入FASTA文件:
myseq <- readAAStringSet("allppseqs.fasta", format="fasta",
nrec=-1L, skip=0L, seek.first.rec=FALSE, use.names=TRUE)

muscle多序列比對:
aln <- muscle::muscle(myseq)
這一步一般比較費時。

當然 muscle() 有很多參數(shù)可以設(shè)置,提速或設(shè)定空位罰分等等:
-
diags = TRUE速度提高自然意味著準確度下降Enhanced speed. To enhance the speed of the algorithm, the diags = TRUE flag will optimize the speed with a potential loss of accuracy.
-
gapopen = -30空位罰分Gap penalties. Default gap penalties can be modified to produce altered alignments. The gap penalty must be negative, with larger negative values indicating more stringent penalties.
-
maxhours = 24.0比對運行的最大時限Maximum number of hours. If an alignment is expected to take a long time, a maximum total numberof hours can be specified, which, if reached, will lead to the algorithm stopping at this point and returningthe current alignment.
aln
detail(aln)


修剪序列:
auto <- maskGaps(aln, min.fraction=0.5, min.block.width=4)
auto
除此之外還可以用 rowmask() colmask() 手動設(shè)置masking區(qū)域。


畫樹!
sdist <- stringDist(as(aotu,"AAStringSet"), method="hamming")
clust <- hclust(sdist, method = "single")
plot(as.phylo(clust),type="fan",cex = 0.8)

References
- A guide to using muscle https://bioconductor.org/packages/release/bioc/vignettes/muscle/inst/doc/muscle-vignette.pdf
- plot單獨畫出pheatmap返回的聚類結(jié)果(聚類樹)http://www.omicsclass.com/article/510
最后,向大家隆重推薦生信技能樹的一系列干貨!
- 生信技能樹全球公益巡講:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
- B站公益74小時生信工程師教學視頻合輯:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
- 招學徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw
