轉(zhuǎn)自 TCGA數(shù)據(jù)下載—TCGAbiolinks包參數(shù)詳解
原創(chuàng) hls 組學(xué)大講堂 2019-10-22
Install tcgabiolink
if(!requireNamespace("BiocManager",quietly=TRUE)){
install.packages("BiocManager")
}
options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor")
BiocManager::install("TCGAbiolinks")
TCGAbiolink-Download
1.GDCquery()? #查詢data
2.getResults()? #下載data
3.GDCprepare() #整理data
##說明書http://www.bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/query.html
GDCquery參數(shù)
1.Project
getGDCprojects()$project_id ,獲取TCGA 中最新的不同癌的項目號
2.data.category
TCGAbiolinks:::getProjectSummary(project)查看project中有哪些數(shù)據(jù)類型,如查詢"TCGA-ACC",
exsample:
TCGAbiolinks:::getProjectSummary('TCGA-ESCA')
TCGAbiolinks:::getProjectSummary('TCGA-ESCA')$file_count[1] 5657
$data_categories? file_count case_count? ? ? ? ? ? ? data_category1? ? ? ? 919? ? ? ? 184? ? Transcriptome Profiling2? ? ? 1486? ? ? ? 184 Simple Nucleotide Variation3? ? ? ? 962? ? ? ? 185? ? ? ? ? ? ? ? Biospecimen4? ? ? ? 207? ? ? ? 185? ? ? ? ? ? ? ? ? ? Clinical5? ? ? ? 202? ? ? ? 185? ? ? ? ? ? DNA Methylation6? ? ? 1115? ? ? ? 185? ? ? Copy Number Variation7? ? ? ? 766? ? ? ? 185? ? ? ? ? ? Sequencing Reads
$case_count[1] 185
$file_size[1] 8.198261e+12
3.data.type
參數(shù)受到熵一個參數(shù)的影響,不同的data.category,會有不同的data.type
4.Workflow.type
這個參數(shù)受到上兩個參數(shù)的影響,不同的data.category和不同的data.type,會有不同的workflow.type,如下表所示:https://www.omicsclass.com/article/1059
legacy這個參數(shù)主要是設(shè)置TCGA數(shù)據(jù)有兩不同入口可以下載,GDC Legacy Archive 和 GDC Data Portal,以下是官方的解釋兩種數(shù)據(jù)Legacy or Harmonized區(qū)別:大致意思為:Legacy 數(shù)據(jù)hg19和hg18為參考基因組(老數(shù)據(jù))而且已經(jīng)不再更新了,Harmonized數(shù)據(jù)以hg38為參考基因組的數(shù)據(jù)(新數(shù)據(jù)),現(xiàn)在一般選擇Harmonized??梢栽O(shè)置為TRUE或者FALSE:
access
Filter by access type. Possible values: controlled, open,篩選數(shù)據(jù)是否開放,這個一般不用設(shè)置,不開放的數(shù)據(jù)也沒必要了,所以都設(shè)置成:access=“open"
7.platform
涉及到數(shù)據(jù)來源的平臺,如芯片數(shù)據(jù),甲基化數(shù)據(jù)等等平臺的篩選,一般不做設(shè)置,除非要篩選特定平臺的數(shù)據(jù):
8. file.type
如果是在GDC Legacy Archive(legacy=TRUE)下載數(shù)據(jù)的時候使用,可以參考官網(wǎng)說明:http://www.bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/query.html
如果在GDC Data Portal,這個參數(shù)不用設(shè)置
9. barcode
A list of barcodes to filter the files to download,可以指定要下載的樣品,例如:
barcode =c"TCGA-14-0736-02A-01R-2005-01""TCGA-06-0211-02A-02R-2005-01"
10. data.format
可以設(shè)置的選項為不同格式的文件:("VCF", "TXT", "BAM","SVS","BCR XML","BCR SSF XML", "TSV", "BCR Auxiliary XML", "BCR OMF XML", "BCR Biotab", "MAF", "BCR PPS XML",? "XLSX"),通常情況下不用設(shè)置,默認就行;
11. experimental.strategy
用于過濾不同的實驗方法得到的數(shù)據(jù):
Harmonized: WXS, RNA-Seq, miRNA-Seq, Genotyping Array.
Legacy: WXS, RNA-Seq, miRNA-Seq, Genotyping Array, DNA-Seq, Methylation array, Protein expression array, WXS,CGH array, VALIDATION, Gene expression array,WGS, MSI-Mono-Dinucleotide Assay, miRNA expression array, Mixed strategies, AMPLICON, Exon array, Total RNA-Seq, Capillary sequencing, Bisulfite-Seq
12. sample.type
對樣本的類型進行過濾,例如,原發(fā)癌組織,復(fù)發(fā)癌等等;
學(xué)習(xí)完成了所有的參數(shù),這里也有舉例使用:
query <- GDCquery(project = "TCGA-ACC",
? ? ? ? ? ? ? ?? data.category = "Copy Number Variation",
? ? ? ? ? ? ? ?? data.type = "Copy Number Segment")
## Not run:
query <- GDCquery(project = "TARGET-AML",
? ? ? ? ? ? ? ?? data.category = "Transcriptome Profiling",
? ? ? ? ? ? ? ?? data.type = "miRNA Expression Quantification",
? ? ? ? ? ? ? ?? workflow.type = "BCGSC miRNA Profiling",
? ? ? ? ? ? ? ?? barcode = c("TARGET-20-PARUDL-03A-01R","TARGET-20-PASRRB-03A-01R"))
query <- GDCquery(project = "TARGET-AML",
? ? ? ? ? ? ? ?? data.category = "Transcriptome Profiling",
? ? ? ? ? ? ? ?? data.type = "Gene Expression Quantification",
? ? ? ? ? ? ? ?? workflow.type = "HTSeq - Counts",
? ? ? ? ? ? ? ?? barcode = c("TARGET-20-PADZCG-04A-01R","TARGET-20-PARJCR-09A-01R"))
query <- GDCquery(project = "TCGA-ACC",
? ? ? ? ? ? ? ?? data.category =? "Copy Number Variation",
? ? ? ? ? ? ? ?? data.type = "Masked Copy Number Segment",
? ? ? ? ? ? ? ?? sample.type = c("Primary solid Tumor"))
query.met <- GDCquery(project = c("TCGA-GBM","TCGA-LGG"),
? ? ? ? ? ? ? ? ? ?? legacy = TRUE,
? ? ? ? ? ? ? ? ? ?? data.category = "DNA methylation",
? ? ? ? ? ? ? ? ? ?? platform = "Illumina Human Methylation 450")
query <- GDCquery(project = "TCGA-ACC",
? ? ? ? ? ? ? ?? data.category =? "Copy number variation",
? ? ? ? ? ? ? ?? legacy = TRUE,
? ? ? ? ? ? ? ?? file.type = "hg19.seg",
? ? ? ? ? ? ? ?? barcode = c("TCGA-OR-A5LR-01A-11D-A29H-01"))
下載數(shù)據(jù)? GDCdownload()
上面的GDCquery()命令完成之后我們就可以用GDCdownload()函數(shù)下載數(shù)據(jù)了,如果數(shù)據(jù)很多,如果中間中斷可以重復(fù)運行GDCdownload()函數(shù)繼續(xù)下載,直到所有的數(shù)據(jù)下載完成,使用舉例如下:
query <-GDCquery(project = "TCGA-GBM",
? ? ? ? ? ? ? ? ? ? ? ? ? data.category = "Gene expression",
? ? ? ? ? ? ? ? ? ? ? ? ? data.type = "Gene expression quantification",
? ? ? ? ? ? ? ? ? ? ? ? ? platform = "Illumina HiSeq",
? ? ? ? ? ? ? ? ? ? ? ? ? file.type? = "normalized_results",
? ? ? ? ? ? ? ? ? ? ? ? ? experimental.strategy = "RNA-Seq",
? ? ? ? ? ? ? ? ? ? ? ? ? barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"),
? ? ? ? ? ? ? ? ? ? ? ? ? legacy = TRUE)
GDCdownload(query, method = "client", files.per.chunk = 10, directory="D:/data")
具體參數(shù)說明如下,主要設(shè)置的參數(shù):query,為GDCquery查詢的結(jié)果,files.per.chunk = 10,設(shè)置同時下載的數(shù)量,如果網(wǎng)速慢建議設(shè)置的小一些, directory="D:/data" 數(shù)據(jù)存儲的路徑;
整理數(shù)據(jù)? GDCprepare()
GDCprepare可以自動的幫我們獲得基因表達數(shù)據(jù):
data <- GDCprepare(query = query,
? ? ? ? ? ? ? ? ? save = TRUE,
? ? ? ? ? ? ? ? ? directory =? "D:/data", ? #注意和GDCdownload設(shè)置的路徑一致GDCprepare才可以找到下載的數(shù)據(jù)然后去處理。 ? ?
? ? ? ? ? ? ? ? ? save.filename = "GBM.RData") ? #存儲一下,方便下載直接讀取
獲得了data數(shù)據(jù)之后,就可以往下進行數(shù)據(jù)挖掘了。