Introduction to Bioconductor for Sequence Data
在院士組輪轉(zhuǎn)的時候,由于沒人安排我去做什么,我也不懂怎么和別人交流,于是那些日子就在翻譯Biocondutor的教程中度過。本文為我學(xué)習(xí)Bioconductor所寫的第一篇學(xué)習(xí)筆記。說是學(xué)習(xí)筆記,差不多等于全文翻譯了bioconductor的 sequence Analysis一章?;旧细咄繑?shù)據(jù)分析到了最后都會用到Bioconductor的包,甚至是前期都可能會用到,
什么是Bioconductor
官方的簡介是:
Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development. It has two releases each year, 1296 software packages, and an active user community. Bioconductor is also available as an AMI (Amazon Machine Image) and a series of Docker images.
簡單的說Bioconductor就是以R語言為平臺的一個高通量基因組數(shù)據(jù)的分析工具。那么下面就算正文環(huán)節(jié)
簡介
Biconductor為高通量基因組數(shù)據(jù)提供了分析和理解方法。我們擁有大量包可對大量數(shù)據(jù)進(jìn)行嚴(yán)格的統(tǒng)計分析,while keeping techological aritiacts in mind. Bioconductor能進(jìn)行多種類型分析:
- 測序: RNASeq, ChIPSeq, variants, copy number...
- 微矩陣: 表達(dá),SNP
- Domain specific analysis: Flow cytometry, Proteomics...
對于這些分析,可用簡單的導(dǎo)入并使用多種序列相關(guān)文件類型,包括,fasta, fastq,BAM,gtf,bed,和wig文件等等.Bionconductor支持導(dǎo)入,常見和高級的序列操作,triming,transformation, and alignment(質(zhì)量評估)
Sequencing Resources
不同階段,不同數(shù)據(jù)格式會用到的包

- IRange, GenomicRanges:基于range的計算,數(shù)據(jù)操作,常見數(shù)據(jù)表示。Biostring: DNA和氨基酸序列表示,比對和模式匹配和大規(guī)模生物學(xué)序列的數(shù)據(jù)操作。ShortRead處理FASTQ文件
- Rsamtools, GenomicAlignments用于比對read(BAM文件)的I/O和數(shù)據(jù)操作 rtracklayer用于導(dǎo)入和導(dǎo)出多種數(shù)據(jù)格式(BED,WIG,bigWig,GTF,GFF)和UCSC genome browser tracks操作
- BSgenomes: 用于獲取和操作規(guī)劃的全基因組。GenomicFeatures常見基因組之間序列特征注釋,biomaRt 獲取Biomart數(shù)據(jù)庫
- SRAdb用于從Sequnce Read Archive(SRA)查找和獲取數(shù)據(jù)。
Bioconductor包通過biocViews進(jìn)行管理,有些詞條位于Sequencing,其他則在其他條目和代表性的包下,包括
ChIPSeq, e.g.,DiffBind, csaw, ChIPseeker, ChIPQC.
SNPs and other variants, e.g., VariantAnnotation, VariantFiltering, h5vc.
CopyNumberVariation e.g., DNAcopy, crlmm, fastseg.
Microbiome and metagenome sequencing, e.g., metagenomeSeq, phyloseq, DirichletMultinomial.
Ranges Infrastructure
許多其他的Bioconductor包高度依賴于IRanges/GenomicRanges所提供的低層數(shù)據(jù)結(jié)構(gòu)
library(GenomicRanges)
GRanges(seqnames=Rle(c('chr1', 'chr2', 'chr3'), c(3, 3, 4)),
IRanges(1:10, width=5), strand='-',
score=101:110, GC = runif(10))
GenomicRanges使得我們可以將染色體坐標(biāo)的一個范圍和一段序列名(如chromosome)以及正負(fù)鏈關(guān)聯(lián)起來。這對描述數(shù)據(jù)(aligned reads坐標(biāo),called ChIP peaks 或copy number variants)和注釋(gene models, Roadmap Epigenomics regularoty elements, dbSNP的已知臨床相關(guān)變異)
GRanes對象,用來儲存基因組位置和關(guān)聯(lián)注釋的向量。
Genomic ranges基本都是通過導(dǎo)入數(shù)據(jù) (e.g., viaGenomicAlignments::readGAlignments())或注釋(e.g., via GenomicFeatures::select() or rtracklayer::import() of BED, WIG, GTF, and other common file formats).進(jìn)行創(chuàng)建
help(package="GenomicRanges")
vignette(package="GenomicRanges")
GRanges常用操作包括findOverlaps, nearest等
FASTA文件中的DNA/氨基酸序列
Biostrings類用于表示DNA或氨基酸序列
library(Biostrings)
d <- DNAString("TTGAAAA-CTC-N")
length(d)
使用AnnotationHub在Ensembl上下載Homo sapiens cDNA序列,文件名為‘Homo_sapiens.GRCh38.cdna.all.fa’
library(AnnotationHub)
ah <- AnnotationHub()
ah2 <- query(ah, c("fasta","homo sapiens","Ensembl"))
fa <- ah2[["AH18522]]
fa
使用Rsamtools打開fasta文件讀取里面的序列和寬度記錄
library(Rsamtools)
idx <- scanFaIndex(fa)
idx
long <- idx[width(idx)>82000]
getSeq(fa,param=long)
BSgenome提供全基因組序列,
library(BSgenome.Hsapiens.UCSC.hg19)
chr14_range <- GRanges("chr14",IRanges(1,seqlengths(HSapiens)["chr14"]))
chr12_dna <- getSeq(Hsapiens,chr14_range)
letterFrequency(chr14_dna,‘GC",as.prob=TRUE)
FASTQ文件中的Reads
ShortRead用于fastq文件.BiocParallel用于加速任務(wù)進(jìn)程
##1. attach ShortRead and BiocParallel
library(ShortRead)
library(BiocParallel)
## 2. create a vectior of file paths
fls <- dir("~/fastq",pattern="*fastq",full=TRUE)
## 3. collect statistics
stats0 <- qa(fls)
## 4. generate and browser the report
if (interactive())
browseURL(report(stats))
BAM中的Aligned Reads
GenomicAlignments用于輸入比對到參考基因組的reads
下一個案例中,我們會讀入一個BAM文件,and specifically read in reads / supporting an apparent exon splice junction/ spanning position 19653773 of chromosome 14.
RNAseqData.HNRNPC.bam.chr14_BAMFILES內(nèi)有8個BAM文件。我們只是用第一個BAM文件。我們會載入軟件包和數(shù)據(jù)包,為目標(biāo)區(qū)域構(gòu)建一個GRanges對象,然后使用summarizeJunctions()尋找目標(biāo)區(qū)域的read
# 1. 載入軟件包
library(GenomicRanges)
library(GenomicAlignments)
# 2. 載入樣本數(shù)據(jù)
library('RNAseqDta.HNRNPC.bam.chr14')
bf <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[[1),asMates=TRUE)
# 3. 定義目標(biāo)區(qū)域
roi <- GRanges("chr14", IRanges(19653773,width=1))
# 4. alignments, junctons, overlapping our roi
paln <- readAlignmentsList(bf)
j <- summarizeJunctions(paln,with.revmap=TRUE)
j_overlap <- j[j %over% roi]
# 5.supporting reads
paln[j_overlap$revmap[[1]]]
Called Variants from VCF files
VCF(Variant Call Files)描述了SNP和其他變異。該文件包括元信息行,含有列名的header line和眾多的數(shù)據(jù)行,每一行都包括基因組上位置信息和每個位置上可選基因型信息。
數(shù)據(jù)通過VariantAnnotation的readVcf()讀入
library(VariantAnnotation)
fl <- system.file("extdata","chr22.vcf.gz",package=“VariantAnnotation")
vcf <- readVcf(fl,"hg19")
Genome Annotations from BED, WIG, GTF etc files
rtracklayer能夠?qū)牒蛯?dǎo)出許多常見的文件類型,BED,WIG,GTF,還能查詢和導(dǎo)航UCSC genome Browser
rtracklayer包含測序所用BED文件
library(rtracklayer)
test_path <- system.file("test",package = "rtracklayer")
test_bed <- file.path(test_path,"test.bed")
test <- import(test_bed,format="bed")
test
AnnotationHub 同樣包括多種基因組注釋文件(BED,GTF,BigWig),使用rtracklayer import()。更多有關(guān)AnnotationHub的介紹需要看 Annotation workflow and AnnotationHub HOW TO vignette。
順便放一下自己的知識星球,如果你覺得我對你有幫助的話。
