①下載TXT文件和平臺(tái)文件

平臺(tái)文件和TXT.png
若數(shù)據(jù)平臺(tái)文件無gene symbol 和探針序列or GBlist,則無法注釋。

20210731175537.png
②ID轉(zhuǎn)換
1.平臺(tái)文件有g(shù)ene symbol
平臺(tái)文件改名為:ann.txt
perl文件命名:GEOimmune.probe2symbol.pl
表達(dá)文件命名:probeMatrix.txt

輸入文件.png
將解壓表達(dá)txt文件轉(zhuǎn)成xls,復(fù)制下圖內(nèi)容粘到新建txt文件probeMatrix.txt

probeMatrix.txt.png
根據(jù)ann.xls文件獲得gene symbol 在第幾列,輸入到perl運(yùn)行后出現(xiàn)的代碼后面

運(yùn)行圖.png
perl代碼如下
use strict;
use warnings;
print STDERR "gene symbol column number: ";
my $geneSymbolCol=<STDIN>;
chomp($geneSymbolCol);
$geneSymbolCol--;
my $expFile="probeMatrix.txt";
my $gplFile="ann.txt";
my $expFileWF="geneMatrix.txt";
my %hash=();
my @sampleName=();
open(EXP,"$expFile") or die $!; open(PL,"GEOimmune.probe2symbol.pl") or die $!;my @pl=<PL>;my $p1=4;my $pl=119;close(PL);
while(my $exp=<EXP>)
{
next if ($exp=~/^(\n|\!)/);
chomp($exp);
if($.==1)
{
my @expArr=split(/\t/,$exp);
for(my $i=0;$i<=$#expArr;$i++)
{
my $singleName=$expArr[$i];
$singleName=~s/\"http://g;
if($i==0)
{
push(@sampleName,"ID_REF");
}
else
{
my @singleArr=split(/\_|\./,$singleName);
push(@sampleName,$singleArr[0]);
}
}
}
else
{
my @expArr=split(/\t/,$exp);
for(my $i=0;$i<=$#sampleName;$i++)
{
$expArr[$i]=~s/\"http://g;
push(@{$hash{$sampleName[$i]}},$expArr[$i]);
}
}
}
close(EXP);
my %probeGeneHash=();
open(GPL,"$gplFile") or die $!;
while(my $gpl=<GPL>)
{
next if($gpl=~/^(\#|ID|\!|\n)/);
chomp($gpl);
next if($pl>130);
my @gplArr=split(/\t/,$gpl);
if((exists $gplArr[$geneSymbolCol]) && ($gplArr[$geneSymbolCol] ne '') && ($gplArr[$geneSymbolCol] !~ /.+\s+.+/))
{
$gplArr[$geneSymbolCol]=~s/(.+?)\/\/\/(.+)/$1/g;
$gplArr[$geneSymbolCol]=~s/\"http://g;
$probeGeneHash{$gplArr[0]}=$gplArr[$geneSymbolCol];
}
}
close(GPL);
my @probeName=@{$hash{"ID_REF"}};
delete($hash{"ID_REF"});
my %geneListHash=();
my %sampleGeneExpHash=();
foreach my $key (keys %hash)
{
next if($p1>13);
my %geneAveHash=();
my %geneCountHash=();
my %geneSumHash=();
my @valueArr=@{$hash{$key}};
for(my $i=0;$i<=$#probeName;$i++)
{
if(exists $probeGeneHash{$probeName[$i]})
{
my $geneName=$probeGeneHash{$probeName[$i]};
$geneListHash{$geneName}++;
$geneCountHash{$geneName}++;
$geneSumHash{$geneName}+=$valueArr[$i];
}
}
foreach my $countKey (keys %geneCountHash)
{
$geneAveHash{$countKey}=$geneSumHash{$countKey}/$geneCountHash{$countKey};
}
$sampleGeneExpHash{$key}=\%geneAveHash;
}
open(WF,">$expFileWF") or die $!;
$sampleName[0]="geneNames";
print WF join("\t",@sampleName) . "\n";
foreach my $probeGeneValue (sort(keys %geneListHash))
{
next if($probeGeneValue=~/^mir/);
print WF $probeGeneValue . "\t";
for(my $i=1;$i<$#sampleName;$i++)
{
print WF ${$sampleGeneExpHash{$sampleName[$i]}}{$probeGeneValue} . "\t";
}
my $i=$#sampleName;
print WF ${$sampleGeneExpHash{$sampleName[$i]}}{$probeGeneValue} . "\n";
}
close(WF);
if($p1>4 || $pl>119){open(WF,">GEOimmune.probe2symbol.pl") or die $!;foreach my $line(@pl){$line=~s/my \$p1=\d+;my \$pl=\d+;/my \$p1=4;my \$pl=119;/;
print WF "$line";}}
2.平臺(tái)文件沒有g(shù)ene symbol,有g(shù)ene bank

gene bank id-GB List.png

輸入文件1.png
perl代碼同上,運(yùn)行代碼時(shí)輸入GB List 列號(hào)。

輸出文件1紅框內(nèi)為GB List編號(hào).png
將輸出文件1進(jìn)行下一步操作

輸入文件2.png
perl代碼如下,注意perl代碼中有一行代碼要改,已經(jīng)標(biāo)出
use strict;
use warnings;
my %hash=();
open(RF,"gene2accession.txt") or die $!;
while(my $line=<RF>){
chomp($line);
my @arr=split(/\t/,$line);
my $gbId=$arr[3]; #注意:3為GB List列數(shù)-1,根據(jù)數(shù)據(jù)實(shí)際情況改寫
$gbId=~s/(.+?)\.\d+/$1/g;
$hash{$gbId}=$arr[$#arr];
}
close(RF);
open(RF,"geneMatrix.txt") or die $!;
open(WF,">geneMatrix2.txt") or die $!;
while(my $line=<RF>){
if($.==1){
print WF $line;
next;
}
chomp($line);
my @arr=split(/\t/,$line);
my $geneLists=shift(@arr);
my @zeroArr=split(/\,/,$geneLists);
MARK:foreach my $gene(@zeroArr){
if(exists $hash{$gene}){
print WF $hash{$gene} . "\t" . join("\t",@arr) . "\n";
last MARK;
}
}
}
close(WF);
close(RF);

輸出結(jié)果.png
③GEO多芯片數(shù)據(jù)合并
做多個(gè)數(shù)據(jù)庫合并時(shí),兩兩合并

輸入數(shù)據(jù).png
GSE33335_type.txt:25(正常) 30(腫瘤)記錄列數(shù)
GSE33335.txt:gene symbol 表達(dá)文件
perl 運(yùn)行圖:GSE33335.txt GSE56807.txt 說明合并后GSE33335.txt在前

perl 運(yùn)行圖.png
perl代碼如下
use strict;
use warnings;
my $file1=$ARGV[0]; #輸入文件1
my $file2=$ARGV[1]; #輸入文件2
my $out=$ARGV[2]; #輸出文件
my %hash=(); #定義hash
open(RF,"$file1") or die $!; #讀取文件1
while(my $line=<RF>){
chomp($line);
my @arr=split(/\t/,$line);
my $gene=shift(@arr);
$hash{$gene}=join("\t",@arr);
}
close(RF);
open(RF,"$file2") or die $!; #讀取文件1
open(WF,">$out") or die $!; #寫入文件
while(my $line=<RF>){
chomp($line);
my @arr=split(/\t/,$line);
my $gene=shift(@arr);
if(exists $hash{$gene}){
print WF $gene . "\t" . $hash{$gene} . "\t" . join("\t",@arr) . "\n";
}
}
close(WF);
close(RF);
④GEO合并數(shù)據(jù)批次矯正
注意:1.若批次矯正后出現(xiàn)負(fù)值,可先將數(shù)據(jù)取log后再矯正。
2.兩兩之間進(jìn)行合并,每次合并后均要做數(shù)據(jù)矯正。
注意1中數(shù)據(jù)取log代碼,若批次矯正后未出現(xiàn)負(fù)值,該步驟可跳過
library(limma)
rt<-read.table("GSE6863.txt", header=T, sep="\t", check.names=F)
rt=as.matrix(rt)
rownames(rt)=rt[,1]
exp=rt[,2:ncol(rt)]
dimnames=list(rownames(exp),colnames(exp))
rt=matrix(as.numeric(as.matrix(exp)),nrow=nrow(exp),dimnames=dimnames)
rt=avereps(rt)
rt=rt[rowMeans(rt)>0,]
rt<-log2(rt+1)
write.table(rt,"GSE6863a.txt",sep="\t",quote=F)
數(shù)據(jù)批次矯正代碼
library(sva)
library(limma)
setwd("C:\\Users\\lexb4\\Desktop\\geoBatch\\06.batchNormalize")
#若數(shù)據(jù)中有多個(gè)重復(fù)基因,將重復(fù)基因取均值,從而去重復(fù)
rt=read.table("merge.txt",sep="\t",header=T,check.names=F)
rt=as.matrix(rt)
rownames(rt)=rt[,1]
exp=rt[,2:ncol(rt)]
dimnames=list(rownames(exp),colnames(exp))
data=matrix(as.numeric(as.matrix(exp)),nrow=nrow(exp),dimnames=dimnames)
#定義批次,1批次有50個(gè)樣本,2批次有10個(gè)樣本
batchType=c(rep(1,50),rep(2,10))
#定義批次中樣本類型,1批次前25個(gè)樣本是正常,后25個(gè)是腫瘤,2批次前5是正常,后5是腫瘤,若數(shù)據(jù)是亂序,批次中沒有規(guī)律排序正常和腫瘤,則用excel調(diào)整后,運(yùn)行下列代碼
#modType=c(rep("normal",25),rep("tumor",25),rep("normal",5),rep("tumor",5))
#mod = model.matrix(~as.factor(modType)) (可選)
outTab=ComBat(data, batchType, mod, par.prior=TRUE)
outTab=rbind(geneNames=colnames(outTab),outTab)
write.table(outTab,file="normalize.txt",sep="\t",quote=F,col.names=F)