上文介紹的GenomicFeatures包在統(tǒng)一方便的同時(shí),也犧牲掉了靈活性。這里介紹另外一個(gè)R包rtracklayer,提供GTF/GFF,BED,BED Graph文件的讀取,處理與導(dǎo)出。
本節(jié)數(shù)據(jù)下載地址: https://github.com/vsbuffalo/bds-files/tree/master/chapter-09-working-with-range-data
數(shù)據(jù)讀取
讀取采用import函數(shù),可將上述文件類型讀取為GenomicRanges對(duì)象,而且自動(dòng)地分配meta-colmns列:
> library(rtracklayer)
> mm_gtf <- import("Mus_musculus.GRCm38.75_chr1.gtf.gz")
> names(mcols(mm_gtf))
[1] "source" "type" "score"
[4] "phase" "gene_id" "gene_name"
[7] "gene_source" "gene_biotype" "transcript_id"
[10] "transcript_name" "transcript_source" "tag"
[13] "exon_number" "exon_id" "ccds_id"
[16] "protein_id"
數(shù)據(jù)導(dǎo)出
采用export函數(shù)導(dǎo)出,同樣可以導(dǎo)出為上述的文件類型。例如提取5個(gè)pseudogene并導(dǎo)出為gtf文件,可以:
> set.seed(0)
> pseudogene_i <- which(mm_gtf$gene_biotype == "pseudogene")
> sample_i <- sample(pseudogene_i, 5)
> pseudo_gene_sample <- mm_gtf[sample_i, ]
> export(pseudo_gene_sample, con = "pseudo_gene_sample.gtf", format = "GTF")
假如我們并不關(guān)心meta-column列的數(shù)據(jù),可以將它們刪除,保存為BED格式(BED文件僅包含染色體編號(hào),起始與結(jié)束位點(diǎn)三列數(shù)據(jù)):
> pseudo_gene_bed <- pseudo_gene_sample
> mcols(pseudo_gene_bed) <- NULL
> pseudo_gene_bed
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 88153788-88154571 +
[2] 1 55443733-55445064 -
[3] 1 121969217-121970524 -
[4] 1 85332716-85342086 -
[5] 1 119787769-119788086 +
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> export(pseudo_gene_bed, con = "pseudo_gene_bed.bed", format="BED")
最后補(bǔ)充一下,rtracklayer提供和UCSC瀏覽器的交互,可以將GenomicRanges對(duì)象生成為UCSC的track并上傳到UCSC網(wǎng)頁,需要的話可以閱讀一下官方文檔了解一下。