ce n'est que le premier pas qui co?te.
seqinr
生物序列獲取和分析
Charif D, Lobry J, Necsulea A, et al. The seqinr Package[J]. R Packag, 2007.
1. 關(guān)于 seqinr
字面意思上,seqinr 只是 "Sequences in R" 的縮寫,也可以理解為 “從GenBank等數(shù)據(jù)庫獲取 (Retrieve) 序列”,很好地解釋了 seqinr 的主要功能:在 R 中,提供一種高效的訪問序列數(shù)據(jù)庫的方法。
2. 以人類CDS序列為例實(shí)戰(zhàn)一波
2.1 從 Ensembl 下載 FASTA 文件

2.2 讀入文件
library(seqinr)
human_cds <- read.fasta("Homo_sapiens.GRCh38.cds.all.fa")

每個(gè)轉(zhuǎn)錄本都是這個(gè) list 的一個(gè)元素。
human_cds[["ENST00000390572.1"]]
# [1] "a" "g" "c" "a" "t" "a" "t" "t" "g" "t" "g" "g" "t" "g" "g" "t" "g" "a" "c" "t"
# [21] "g" "c" "t" "a" "t" "t" "c" "c"
# attr(,"name")
# [1] "ENST00000390572.1"
# attr(,"Annot")
# [1] ">ENST00000390572.1 cds chromosome:GRCh38:14:105888551:105888578:-1 gene:ENSG00000211912.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:IGHD2-21 description:immunoglobulin heavy diversity 2-21 [Source:HGNC Symbol;Acc:HGNC:5491]"
# attr(,"class")
# [1] "SeqFastadna"
2.3 幾個(gè)統(tǒng)計(jì)序列概況的函數(shù)
-
count()統(tǒng)計(jì)序列中寡聚物 (oligomers) 的數(shù)量,參數(shù)
wordsize決定 "n-mer". -
GC()統(tǒng)計(jì)核苷酸序列中的GC含量。
length()
cds66 <- human_cds[[66]]
cds66
# [1] "t" "g" "a" "c" "t" "a" "c" "a" "g" "t" "a" "a" "c" "t" "a"
# [16] "c"
# attr(,"name")
# [1] "ENST00000634085.1"
# attr(,"Annot")
# [1] ">ENST00000634085.1 cds chromosome:GRCh38:CHR_HSCHR14_3_CTG1:105913993:105914008:-1 gene:ENSG00000282227.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:IGHD4-4 description:immunoglobulin heavy diversity 4-4 [Source:HGNC Symbol;Acc:HGNC:5505]"
# attr(,"class")
# [1] "SeqFastadna"
count(cds66,1)
#
# a c g t
# 6 4 2 4
count(cds66,2)
#
# aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
# 1 4 1 0 1 0 0 2 1 0 0 1 3 0 1 0
count(cds66,3)
#
# aaa aac aag aat aca acc acg act aga agc agg agt ata atc atg att
# 0 1 0 0 1 0 0 2 0 0 0 1 0 0 0 0
# caa cac cag cat cca ccc ccg cct cga cgc cgg cgt cta ctc ctg ctt
# 0 0 1 0 0 0 0 0 0 0 0 0 2 0 0 0
# gaa gac gag gat gca gcc gcg gct gga ggc ggg ggt gta gtc gtg gtt
# 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0
# taa tac tag tat tca tcc tcg tct tga tgc tgg tgt tta ttc ttg ttt
# 1 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0
GC(cds66)
# [1] 0.375
length(cds66)
# [1] 16
2.4 提取注釋信息
annotation <- getAnnot(human_cds)
annotation[[66]]
# [1] ">ENST00000634085.1 cds chromosome:GRCh38:CHR_HSCHR14_3_CTG1:105913993:105914008:-1 gene:ENSG00000282227.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:IGHD4-4 description:immunoglobulin heavy diversity 4-4 [Source:HGNC Symbol;Acc:HGNC:5505]"
發(fā)現(xiàn)只有 "description" 里的信息比較有辨識(shí)度,于是——
cancer_index <- grep("cancer",annotation,ignore.case = T)
length(cancer_index)
# [1] 95
cancer_index
# [1] 1891 1892 5120 5121 5122 16950 16951 16952 16953 16954
# [11] 16955 16956 16957 23827 23828 23829 23830 23831 29035 29036
# [21] 37044 37132 37207 37219 37239 37259 37405 37484 37503 37537
# [31] 47785 56845 57929 59658 59659 59714 59715 60447 60448 68942
# [41] 78490 78491 78492 78493 78494 78495 78496 78497 78498 84049
# [51] 84050 84051 84052 86921 86922 86923 86924 86925 86926 86927
# [61] 88828 88834 88882 88883 92802 92803 97079 97080 97081 97082
# [71] 97083 97084 97085 97086 100900 101862 101863 102719 102720 102723
# [81] 102724 102879 102880 102890 102891 102901 102902 102973 102974 102976
# [91] 102977 102978 102979 102988 102989
驗(yàn)證一下,"description" 里確實(shí)是有 "cancer" 的。
annotation[[102989]]
# [1] ">ENST00000594117.4 cds chromosome:GRCh38:X:135718930:135723539:1 gene:ENSG00000268940.5 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CT45A1 description:cancer/testis antigen family 45 member A1 [Source:HGNC Symbol;Acc:HGNC:33267]"
2.5 將篩選結(jié)果導(dǎo)出為 FASTA 文件
cancer_transcript <- human_cds[cancer_index]
length(cancer_transcript)
# [1] 95 ## 和index的長度是一樣的
cancer_annotation <- getAnnot(cancer_transcript)
cancer_sequence <- getSequence(cancer_transcript)
write.fasta(cancer_sequence,cancer_annotation,"human_cancertscpt.fa")
再導(dǎo)入文件查看一下:
human_cancertscpt <- read.fasta("human_cancertscpt.fa")

3. 一些可視化操作
3.1 序列長度分布圖
以上面得到的 cancer_transcript 為例。
getLength(cancer_transcript)
# [1] 795 729 2142 468 1245 2103 2454 2520 1926 729 2334 2475 1674 1821 1473
# [16] 1890 267 99 213 210 867 867 867 867 867 867 867 867 867 867
# [31] 411 411 342 543 633 543 507 633 543 975 2169 1647 2073 2151 2031
# [46] 108 1974 138 278 2142 1707 1578 1230 756 402 861 320 465 465 465
# [61] 900 126 867 867 102 102 338 534 1311 954 1143 249 486 75 408
# [76] 573 687 570 570 570 570 570 570 570 570 570 570 570 570 570
# [91] 570 570 570 570 570
table(getLength(cancer_transcript))
#
# 75 99 102 108 126 138 210 213 249 267 278 320 338 342 402 408
# 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
# 411 465 468 486 507 534 543 570 573 633 687 729 756 795 861 867
# 2 3 1 1 1 1 3 18 1 2 1 2 1 1 1 12
# 900 954 975 1143 1230 1245 1311 1473 1578 1647 1674 1707 1821 1890 1926 1974
# 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# 2031 2073 2103 2142 2151 2169 2334 2454 2475 2520
# 1 1 1 2 1 1 1 1 1 1
lengths <- table(getLength(cancer_transcript))
barplot(lengths,xlab = "Cancer Lranscripts Lengths",ylab = "Frequency")

3.2 GC含量火山圖
cancer_tsGC <- unlist(lapply(cancer_transcript, GC))
hist(cancer_tsGC)

以為這就學(xué)會(huì)惹,這時(shí)作者出現(xiàn)在墻角: "ce n'est que le premier pas qui co?te."
It is only the first step that costs.
References
- Version 3.1-5 of the seqinR manual http://seqinr.r-forge.r-project.org/seqinr_3_1-5.pdf
- DAVE TANG'S BLOG Using the R SeqinR package https://davetang.org/muse/2013/05/09/using-the-r-seqinr-package/
- DNA Sequence Statistics (1) https://a-little-book-of-r-for-bioinformatics.readthedocs.io/en/latest/src/chapter1.html#gc-content-of-dna
最后,向大家隆重推薦生信技能樹的一系列干貨!
- 生信技能樹全球公益巡講:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
- B站公益74小時(shí)生信工程師教學(xué)視頻合輯:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
- 招學(xué)徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw
