TCGA數(shù)據(jù)庫中各類癌癥的類型縮寫,可參考這篇同學(xué)的文章:
http://www.itdecent.cn/p/3c0f74e85825
當(dāng)我們通過gdc-tools下載得到TCGA數(shù)據(jù)庫中RNA,miRNA等數(shù)據(jù)時,還需要得到這些樣本的注釋信息,一般保存在metadata-xiazairiqi.json中,既可以通過R語言的rjson包進行注釋信息的提取,也可以通過perl語言+shell腳本進行信息的提?。?br> 先寫一個簡單的perl腳本 vim meta.pl
#!/usr/bin/perl -w
while(<>){
if(/file_name.*gz/ | /submitter_id.*TCGA.*\, $/)
{
print $_;
}
}
再通過簡單的shell腳本,就可以得到注釋我們需要的注釋信息:第一列file_name,和你下載的數(shù)據(jù)文件名對應(yīng);第二列 TCGA樣本編號,通過編號我們可以獲取它的分組信息,簡單可這樣理解,即0~10以內(nèi)是癌癥,10以上時正常對照組
cat metadata.cart.2019-04-26.json |perl meta.pl |paste - -|less -SN
1 "file_name": "555de98a-5925-41d0-8095-7ae42c480861.htseq.counts.gz", "entity_submitter_id": "TCGA-A1-A0SP-01A-11R-A084-07",
2 "file_name": "16942d90-640a-4f7f-9822-e613cd44b3a7.htseq.counts.gz", "entity_submitter_id": "TCGA-A8-A07I-01A-11R-A00Z-07",
3 "file_name": "86272569-4b9c-4d44-b8f1-daeb9348a6e0.htseq.counts.gz", "entity_submitter_id": "TCGA-EW-A1IZ-01A-11R-A13Q-07",
4 "file_name": "7f2cf950-b5e1-4a01-a44b-88a4e3303233.htseq.counts.gz", "entity_submitter_id": "TCGA-B6-A0RH-01A-21R-A115-07",
5 "file_name": "7af2075c-0386-4971-ae25-375330ef6cec.htseq.counts.gz", "entity_submitter_id": "TCGA-A7-A0DB-01A-11R-A00Z-07",
6 "file_name": "78f2dfc0-9452-4547-b9a9-eb9dc920a4a9.htseq.counts.gz", "entity_submitter_id": "TCGA-UL-AAZ6-01A-11R-A41B-07",
或者直接使用perl腳本進行處理
perl腳本
#!/usr/bin/perl -w
my @array;
while(<>){
chomp;
if(/file_name.*gz/ | /submitter_id.*TCGA.*\, $/)
{
push @array, $_;
}
}
my %hash = @array;
foreach my $k (keys %hash){
print "$k $hash{$k}\n";
}
運行perl腳本
cat metadat.json| perl meta.pl
"file_name": "555de98a-5925-41d0-8095-7ae42c480861.htseq.counts.gz", "entity_submitter_id": "TCGA-A1-A0SP-01A-11R-A084-07",
"file_name": "16942d90-640a-4f7f-9822-e613cd44b3a7.htseq.counts.gz", "entity_submitter_id": "TCGA-A8-A07I-01A-11R-A00Z-07",
"file_name": "86272569-4b9c-4d44-b8f1-daeb9348a6e0.htseq.counts.gz", "entity_submitter_id": "TCGA-EW-A1IZ-01A-11R-A13Q-07",
關(guān)于TCGA編號信息詳細(xì)的講解,可以參考這位同學(xué)的文章
http://www.biowolf.cn/TCGA/tcga_sample.html