從TCGA數(shù)據(jù)庫(kù)中下載數(shù)據(jù)的方法有多種,常用的方法有:
1.從TCGA官網(wǎng)直接下載(數(shù)據(jù)量小的話可行,數(shù)據(jù)量太大就pass)
2.用TCGA官方工具gdc-client下載
3.用R語言中的TCGAbiolinks包下載
之前下載臨床數(shù)據(jù)時(shí),先選擇gdc-client工具下載,后來下載表達(dá)數(shù)據(jù)時(shí)老斷,下載了幾天仍然沒下載成功,便采用TCGAbiolinks包下載(真的是個(gè)寶藏程序包,之前花了幾天時(shí)間沒下載成功的文件,用這個(gè)包分分鐘就就下載好了)
接下來使用TCGAbiolinks包下載數(shù)據(jù)
> library(TCGAbiolinks)
> setwd("D:/breast_cancer/TCGA/Biolinks/expFPKM")
> query <- GDCquery(project = "TCGA-BRCA",
+ legacy = FALSE, #默認(rèn)參數(shù)是FALSE時(shí),下載hg38數(shù)據(jù),否則下載hg19數(shù)據(jù)
+ experimental.strategy = "RNA-Seq",
+ data.category = "Transcriptome Profiling",
+ data.type = "Gene Expression Quantification",
+ workflow.type = "HTSeq - FPKM")
#出現(xiàn)下面的結(jié)果說明數(shù)據(jù)獲取成功,否則重新獲取
--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-BRCA
--------------------
oo Filtering results
--------------------
ooo By experimental.strategy
ooo By data.type
ooo By workflow.type
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
#獲取成功后就開始下載數(shù)據(jù)
> GDCdownload(query)
#下載成功后的結(jié)果
Downloading data for project TCGA-BRCA
GDCdownload will download 1222 files. A total of 635.710654 MB
Downloading as: Fri_Apr_16_11_16_59_2021.tar.gz
Downloading: 640 MB
補(bǔ)充:
workflow.type有三種類型:
HTSeq - FPKM-UQ:FPKM上四分位數(shù)標(biāo)準(zhǔn)化值
HTSeq - FPKM:FPKM值/表達(dá)量值
HTSeq - Counts:原始count數(shù)
從TCGA-xena上下載counts矩陣
dat = read.table("counts.tsv.gz",check.names = F,row.names = 1,header = T) #check.names = F是指在讀取矩陣時(shí),不檢查列名,不把'-'當(dāng)作減號(hào)處理
#逆轉(zhuǎn)log
dat = as.matrix(2^dat - 1)
dat[1:4,1:4]
as.character(dat[1:100,1:10]) #有一些小數(shù)
# 用apply轉(zhuǎn)換為整數(shù)矩陣
exp = apply(dat, 2, as.integer)
exp[1:4,1:4] #行名消失
rownames(exp) = rownames(dat)