TCGA(The cancer genome atlas,癌癥基因組圖譜)由 National Cancer Institute(NCI,美國國家癌癥研究所) 和 National Human Genome Research Institute(NHGRI,美國國家人類基因組研究所)于 2006 年聯(lián)合啟動的項目, 收錄了各種人類癌癥(包括亞型在內(nèi)的腫瘤)的各種各樣的測序數(shù)據(jù),包括RNA sequencing,MicroRNA sequencing,DNA sequencing,SNP-based platforms,Array-based DNA methylation sequencing,Reverse-phase array(包含了基因組、轉(zhuǎn)錄組、蛋白組、表觀組各個組學(xué)數(shù)據(jù)),另外整合了臨床樣本信息Biospecimen以及Clinical。
官方網(wǎng)址:https://www.cancer.gov/ccg/research/genome-sequencing/tcga
GDC data portal:https://portal.gdc.cancer.gov/
注意: TCGA數(shù)據(jù)庫中的測序數(shù)據(jù)共分為四層:level1、level2、level3、level4。其中,level3、level4的數(shù)據(jù)(經(jīng)處理及標(biāo)準(zhǔn)化后數(shù)據(jù))一般都開放下載的,level1是最原始的數(shù)據(jù)(Fasta、Fastq等),level2是做了進一步的處理的(經(jīng)比對Bam文件),這些數(shù)據(jù)一般是不開放的,需要申請才能下載
1. 數(shù)據(jù)集介紹

★ "Repository"(存儲庫頁面)是訪問 GDC 數(shù)據(jù)門戶中數(shù)據(jù)的主要方法。它提供了 GDC 中可用的所有案例和文件的概覽,并為用戶提供了各種過濾器來識別和瀏覽感興趣的Cases(案例)和Files(文件)。
1.1
介紹
Files包含與數(shù)據(jù)文件和實驗策略相關(guān)的方面,
1.1.1
(數(shù)據(jù)類別)
? simple nucleotide variation:簡單核苷酸編譯
? sequencing reads:測序讀取
? copy number variation:拷貝數(shù)變異
? structural variation:結(jié)構(gòu)變異
? transcriptome profiling:轉(zhuǎn)錄組分析
? biospecimen:生物樣本
? dna methylation:DNA甲基化
? clinical:臨床
? somatic structural variation:體細(xì)胞結(jié)構(gòu)變異
? proteome profiling:蛋白組分析
? combined nucleotide variation: 核苷酸組合變異
1.1.2
(數(shù)據(jù)類型)
? Annotated Somatic Mutation:體細(xì)胞突變的注釋文件,格式為VCF,采用VEP軟件進行注釋,文件后綴為vep.vcf.gz
? Aligned Reads:對齊讀取
? Raw Simple Somatic Mutation: 體細(xì)胞突變的原始文件,格式為VCF,文件后綴為vcf.gz
? Masked Somatic Mutation:open的突變注釋文件,免費下載的,格式為MAF,文件后綴為maf.gz
? Aggregated Somatic Mutation:controlled的突變注釋文件,需要賬號和權(quán)限才可以下載,格式為MAF,文件后綴為maf.gz
? Gene Level Copy Number Scores:基因水平拷貝數(shù)分?jǐn)?shù)
? Gene Expression Quantification:基因表達量化
等
1.1.3
(實驗策略)
? WXS:全外顯子組測序,主要用來檢測體細(xì)胞突變(SNP),提供了四種方法(MuSE、MuTect2、VarScan2、SomaticSniper)來分析SNP,并分別提供了結(jié)果
? RNA-Seq:全轉(zhuǎn)錄組測序,包含lncRNA、mRNA、假基因等等。
? Targeted Sequencing:靶向測序
? Genotyping Array:基因分型陣列
? miRNA-Seq:miRNA測序(miRNA是一類由內(nèi)源基因編碼的長度約為22個核苷酸的非編碼單鏈RNA分子,生物中非常重要的一類非編碼小RNA,其在生物體的調(diào)控中具有非常重要的作用,在人中大約三分之一的基因受到miRNA的調(diào)控),數(shù)據(jù)只有一種 - BCGSC
? Methylation Array:甲基化數(shù)據(jù)(DNA甲基化能引起染色質(zhì)結(jié)構(gòu)、DNA構(gòu)象、DNA穩(wěn)定性及DNA與蛋白質(zhì)相互作用方式的改變,從而控制基因表達),TCGA提供的甲基化芯片數(shù)據(jù),主要以CpG位點為單位,一般認(rèn)為在基因啟動子區(qū)域上(基因的TSS的上游2kb到下游500bp之間)的甲基化對該基因的表達會產(chǎn)生影響;目前TCGA提供的公開下載的甲基化數(shù)據(jù)主要為level3的CpG位點(DNA序列上堿基為C或者G的位點,一般公認(rèn)的甲基化只會發(fā)生在CpG位點上)的甲基化水平的數(shù)據(jù)
? WGS:全基因組測序
等
1.1.4
(工作流類型)
? DNAcopy:DNA拷貝
? GENIE Simple Somatic Mutation:簡單體細(xì)胞突變
? GENIE Copy Number Variation:拷貝值變化
? BCGSC miRNA Profiling:miRNA分析
等
1.1.5
(數(shù)據(jù)格式)
略
1.1.6
(平臺)
略
1.1.7
? controlled:需要申請賬號才可以下載
? open:開放的數(shù)據(jù)
1.2.
介紹
Cases卡包含與案例和生物樣本信息相關(guān)的方面
1.2.1
(案例編號)
略
1.2.2
(原發(fā)部位)
? bronchus and lung:支氣管和肺
? breast:乳腺
? hematopoietic and reticuloendothelial systems:造血和網(wǎng)狀內(nèi)皮系統(tǒng)
? colon:結(jié)腸
? ovary:卵巢
等
1.2.3
(不同數(shù)據(jù)集)
? GENIE
? FM
? TCGA
? TARGET
等
1.2.4
(項目)
? FM-AD
? GENIE-MSK
? GENIE-DFCI
? GENIE-MDA
? GENIE-JHU
等
1.2.5
(疾病類型)
? adenomas and adenocarcinomas:腺癌
? ductal and lobular neoplasms:導(dǎo)管和小葉腫瘤
? epithelial neoplasms, nos:上皮性腫瘤
? squamous cell neoplasms:鱗狀細(xì)胞腫瘤
? gliomas:神經(jīng)膠質(zhì)瘤
等
1.2.6
(性別)
? female:女性
? male:男性
? unknown:未知
? not reported:未報導(dǎo)
? unspecified:不明確
1.2.7
(診斷年齡)
略
1.2.8
(生存狀態(tài))
? not reported:未報導(dǎo)
? alive:存活
? dead:死亡
? unknown:未知
1.2.9
(死亡天數(shù))
略
1.2.10
(人種)
略
1.2.11
(種族)
? not hispanic or latino:不是西班牙裔或拉丁裔
? not reported:未報道
? unknown:未知
? hispanic or latino:西班牙裔或拉丁裔
2. 數(shù)據(jù)集下載
2.1

? Cases:選擇"Project - TCGA-LUAD(肺腺癌)"
? Files:選擇"Data Category - transcriptome profiling","Experimental Strategy - RNA-Seq","Workflow Type - STAR-Counts","Access - open"
? 最終得倒Manifest文件:gdc_manifest_20230608_063108.txt
2.2 下載
工具
TCGA的全部數(shù)據(jù)都存儲在GDC的Data Portal中,如果想要下載大量數(shù)據(jù),經(jīng)常需要使用到gdc官方下載工具:gdc-client
gdc-client工具網(wǎng)址:GDC Data Transfer Tool | NCI Genomic Data Commons (cancer.gov)
$ wget https://gdc.cancer.gov/files/public/file/gdc-client_v1.6.1_Ubuntu_x64.zip
$ unzip gdc-client_v1.6.1_Ubuntu_x64.zip # 解壓
$ vim ~/.bashrc # 修改.bashrc文件
# 在最后一行添上:
export PATH=you gdc-client path:$PATH
$ source ~/.bashr c# 激活.bashrc文件
$ gdc-client --help # 檢測是否安裝配置成功
## usage: gdc-client [-h] [--version] {download,upload,settings} ...
##
## The Genomic Data Commons Command Line Client
##
## optional arguments:
## -h, --help show this help message and exit
## --version show program's version number and exit
##
## commands:
## {download,upload,settings}
## for more information, specify -h after a command
## download download data from the GDC
## upload upload data to the GDC
## settings display default settings
gdc-client download -m gdc_manifest_20230608_063108.txt # 下載數(shù)據(jù)

? 最終得到600個樣本(紅框,對應(yīng)600個文件夾)的表達矩陣(綠框 ".tsv文件")
? Metadata下載json文件即可
2.3
# 讀取metadata文件,用于associated_entities和file_name的轉(zhuǎn)換
library("rjson")
json <- jsonlite::fromJSON("/data/shumin/Cibersort/LUAD/metadata.cart.2023-06-08.json")
View(json)
sample_id <- sapply(json$associated_entities, function(x){x[,1]})
file_sample <- data.frame(sample_id, file_name = json$file_name)
# 獲取count文件夾下的所有tsv表達文件的 路徑+文件名
count_file <- list.files('counts/', pattern = '*.tsv', recursive = TRUE)
#在count_file中分割出文件名
count_file_name <- strsplit(count_file, split = '/')
count_file_name <- sapply(count_file_name, function(x){x[2]})
matrix = data.frame(matrix(nrow = 60660, ncol = 0))
for (i in 1:length(count_file)){
path = paste0('counts/', count_file[i])
data <- read.delim(path, fill = TRUE, header = FALSE, row.names = 1)
colnames(data) <- data[2, ]
data <- data[-c(1:6), ] # 去除前六行注釋信息
data <- data[3] #取出unstranded列(第3列),即count數(shù)據(jù),對應(yīng)其它數(shù)據(jù)
colnames(data) <- file_sample$sample_id[which(file_sample$file_name == count_file_name[i])]
matrix <- cbind(matrix, data)
}
write.csv(matrix, 'LUAD_Ensembl_COUNT_matrix.csv', row.names = TRUE) # 寫出文件,此時得到的gene id是"Ensembl ID"
# 設(shè)置Gene Symbol為列名
path = paste0('counts/', count_file[1])
data <- as.matrix(read.delim(path, fill = TRUE, header = FALSE, row.names = 1))
gene_name <- data[-c(1:6), 1]
matrix0 <- cbind(gene_name, matrix)
#將gene_name列去除重復(fù)的基因,保留每個基因最大表達量結(jié)果
matrix0 <- aggregate( . ~ gene_name, data = matrix0, max)
#將gene_name列設(shè)為行名
rownames(matrix0) <- matrix0[, 1]
matrix0 <- matrix0[, -1]
write.csv(matrix0, 'LUAD_SYMBOL_COUNT_matrix.csv', row.names = TRUE)
# 分為normal和tumor矩陣
sample <- colnames(matrix0)
normal <- c()
tumor <- c()
for (i in 1:length(sample)){
if((substring(colnames(matrix0)[i], 14, 15) >= 10)){ #14、15位置>=10的為normal樣本
normal <- append(normal, sample[i])
} else {
tumor <- append(tumor, sample[i])
}
}
tumor_matrix <- matrix0[, tumor]
normal_matrix <- matrix0[, normal]
3. 差異分析
# 構(gòu)建分組矩陣
All_sample <- cbind(tumor_matrix, normal_matrix)
# 將數(shù)據(jù)框里的數(shù)據(jù)從字符串型轉(zhuǎn)為數(shù)值型
All_sample_2 <- as.data.frame(lapply(All_sample, as.numeric))
row.names(All_sample_2) <- row.names(All_sample)
colnames(All_sample_2) <- colnames(All_sample)
# 過濾,原則:有原則是某基因在所有樣本中的read counts之和小于樣本總數(shù)(樣本均read counts<1)
All_sample_2 <- All_sample_2[which(rowSums(All_sample_2)/ncol(All_sample_2) >= 1),]
group_list <- c(rep('tumor', ncol(tumor_matrix)), rep('normal', ncol(normal_matrix)))
condition = factor(group_list)
coldata <- data.frame(row.names = colnames(All_sample_2), condition)
3.1
差異分析
library(DESeq2)
# 構(gòu)建 DESeqDataSet 對象并過濾低豐度數(shù)據(jù)
dds <- DESeqDataSetFromMatrix(countData = All_sample_2, colData = coldata, design = ~ condition)
keep <- rowSums(counts(dds) >1 ) >= 2
dds.keep <- dds[keep, ]
# 計算歸一化系數(shù)- sizeFactor
dds <- estimateSizeFactors(dds)
dds_norm <- DESeq(dds.keep)
# 提取結(jié)果
result_DESeq2 <- results(dds_norm, contrast = c("condition", "tumor", "normal"), name = 'condition_tumor_vs_normal', alpha = 0.05, lfcThreshold = 2, cooksCutoff = FALSE) # alpha = 0.05可指定padj; cookCutoff是不篩選outliers,因為太多了
summary(result_DESeq2)
##
## out of 36234 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 2.00 (up) : 1566, 4.3%
## LFC < -2.00 (down) : 318, 0.88%
## outliers [1] : 0, 0%
## low counts [2] : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
result <- as.data.frame(results(dds))
# 提取顯著差異表達基因的矩陣
DGE_DESeq2 <-subset(result, padj < 0.05 & (log2FoldChange > 2 | log2FoldChange < -2))
DGE_DESeq2 <- DGE_DESeq2[order(DGE_DESeq2$log2FoldChange), ]
# 輸出結(jié)果
resOrdered_DESeq2 <- result[order(result$padj, result$log2FoldChange, decreasing = c(FALSE, TRUE)), ] # 默認(rèn)decreasing = F
resOrdered_DESeq2[which(resOrdered_DESeq2$log2FoldChange >= 2 & resOrdered_DESeq2$padj < 0.05), 'sig'] <- 'up'
resOrdered_DESeq2[which(resOrdered_DESeq2$log2FoldChange <= -2 & resOrdered_DESeq2$padj < 0.05), 'sig'] <- 'down'
resOrdered_DESeq2[which(abs(resOrdered_DESeq2$log2FoldChange) <= 2 | resOrdered_DESeq2$padj >= 0.05), 'sig'] <- 'none'
write.table(resOrdered_DESeq2, 'DESeq2_all.txt', col.names = T, row.names = T, sep = '\t', quote = FALSE)
3.2
差異分析
library(edgeR)
# DEGList對象、構(gòu)建分組
degs <- DGEList(counts = All_sample_2, group = coldata$condition);degs
# 過濾,刪除低表達基因
keep <- rowSums(cpm(degs) > 1 ) >= 2
degs.keep <- degs[keep,]
# 歸一化
degs.norm <- calcNormFactors(degs.keep, method = 'TMM')
# 估計離散度
degs.norm <- estimateCommonDisp(degs.norm, verbose=TRUE)
degs.norm <- estimateTagwiseDisp(degs.norm)
# 通過檢驗差異來鑒別差異表達基因
et <- exactTest(degs.norm)
top <- topTags(et, 50) # 可以設(shè)置參數(shù)n來調(diào)整顯示的條目,默認(rèn)是n = 10;可以通過設(shè)置sort.by = "logFC"來按照|logFC|的降序排序;可以通過設(shè)置p.value = 0.05來篩選p-adjusted value < 0.05的基因
summary(de <- decideTestsDGE(et, adjust.method = "BH", p.value = 0.05, lfc = 2)) # 默認(rèn)的參數(shù)是adjust.method = "BH",也可以設(shè)置為adjust.method = "fdr";p.value = 0.05也是默認(rèn)的參數(shù),指的是篩選FDR小于0.05的基因;默認(rèn)的lfc為lfc=0
## tumor-normal
## Down 728
## NotSig 25748
## Up 5413
# 輸出結(jié)果
resOrdered_edgeR <- et[["table"]][order(et[["table"]]$PValue, et[["table"]]$logFC, decreasing = c(FALSE, TRUE)), ] # 默認(rèn)decreasing = F
resOrdered_edgeR[which(resOrdered_edgeR$logFC >= 2 & resOrdered_edgeR$PValue < 0.05), 'sig'] <- 'up'
resOrdered_edgeR[which(resOrdered_edgeR$logFC <= -2 & resOrdered_edgeR$PValue < 0.05), 'sig'] <- 'down'
resOrdered_edgeR[which(abs(resOrdered_edgeR$logFC) <= 2 | resOrdered_edgeR$PValue >= 0.05), 'sig'] <- 'none'
write.table(resOrdered_edgeR, 'edgeR_all.txt', col.names = T, row.names = T, sep = '\t', quote = FALSE)
3.3
差異分析
library(Lima)
# 構(gòu)建design 樣本分組矩陣
design <- model.matrix(~0 + coldata$condition)
rownames(design) = colnames(All_sample_2)
colnames(design) <- levels(coldata$condition)
# 構(gòu)建DGElist對象
DGElist <- DGEList(counts = All_sample_2, group = coldata$condition)
# 去除表達量過低的基因
keep <- rowSums(cpm(DGElist) > 1) >= 2
table(keep)
DGElist <- DGElist[keep, ,keep.lib.sizes = FALSE]
# 計算列的矯正因子
DGElist <- calcNormFactors(DGElist )
# 將count值轉(zhuǎn)化成log2-counts per million (logCPM),準(zhǔn)備進行線性回歸
v <- voom(DGElist, design, plot = TRUE, normalize = "quantile")
# 對每一個基因進行線性模型構(gòu)建
fit <- lmFit(v, design)
# 構(gòu)建比較矩陣
cont.matrix <- makeContrasts(contrasts = c('tumor - normal'), levels = design)
# 構(gòu)建芯片數(shù)據(jù)的線性模型,計算估計的相關(guān)系數(shù)和標(biāo)準(zhǔn)差
fit2 <- contrasts.fit(fit, cont.matrix)
# 基于貝葉斯計算T值,F(xiàn)值和log-odds
fit2 <- eBayes(fit2)
nrDEG = topTable(fit2, coef = 1, n = Inf, adjust = "BH")
nrDEG = na.omit(nrDEG)
# 首先對表格排個序,按 padj 值升序排序,相同 padj 值下繼續(xù)按 log2FC 降序排序
resOrdered_limma <- nrDEG[order(nrDEG$adj.P.Val, nrDEG$logFC, decreasing = c(FALSE, TRUE)), ] # 默認(rèn)decreasing = F
resOrdered_limma[which(resOrdered_limma$logFC >= 2 & resOrdered_limma$adj.P.Val < 0.05), 'sig'] <- 'up'
resOrdered_limma[which(resOrdered_limma$logFC <= -2 & resOrdered_limma$adj.P.Val < 0.05), 'sig'] <- 'down'
resOrdered_limma[which(abs(resOrdered_limma$logFC) <= 2 | resOrdered_limma$adj.P.Val >= 0.05), 'sig'] <- 'none'
# 輸出選擇的差異基因總表
write.table(resOrdered_limma, 'limma_all.txt', col.names = T, row.names = T, sep = '\t', quote = FALSE)
參考: