????這段時(shí)間簡(jiǎn)書因?yàn)橐恍┎豢擅枋龅奈恼卤徽D了,所以沒有做學(xué)習(xí)筆記。加上最近開學(xué),亂七八糟的事情很多,目測(cè)要下個(gè)禮拜開始上課才能夠回歸正軌。
????最近做了一個(gè)事情就是,注釋出來的初始基因組pep文件會(huì)存在許多個(gè)轉(zhuǎn)錄本,很多冗余的氨基酸序列,我們需要去除這些冗余,取最長(zhǎng)的那個(gè)轉(zhuǎn)錄本。
????對(duì)于序列只有一行的很容易可以用grep辦到,但是還是那個(gè)問題,fasta格式存在自動(dòng)換行問題,所以我們最好還是寫腳本提取最長(zhǎng)轉(zhuǎn)錄本比較穩(wěn)妥,這里分享一下所用的腳本。
????原數(shù)據(jù)格式基本如下:
>Gene1.m1 gene=Gene1
MSVAADSPIHSSSSDDFIAYLDDALAASSPDASSDKEVENQDELESGRIKRCKFESAEETEESTSEGIVK
QNLEEYVCTHPGSFGDMCIRCGQKLDGESGVTFGYIHKGLRLHDEEISRLRNTDVKNLLIRKKLYLILDL
...
????我寫的Perl腳本修改了原文件格式和順序(哈希表的sort真是個(gè)謎),以后有時(shí)間我會(huì)考慮一下如何保留所有原格式輸出,暫時(shí)將就著用吧。
#use strict;
#use warnings;
my %pep=();
my $trans = "";
my $gene = "";
open IN,"$ARGV[0]" or die "fail to open!";
open OUT,">$ARGV[1]" or die "fail to create!";
while (<IN>){
chomp;
my $line = $_;
if ($line =~ /^>/){
my @items = split /\s+/,$line;
$gene = $items[1];
$gene =~ s/gene=//g;
$trans = $items[0];
$trans =~ s/>//g;
}else{
$line =~ s/\.//g;
$pep{$gene}{$trans}=$pep{$gene}{$trans}.$line;
}
}
for my $key1(sort{$pep{$a} <=> $pep{$b}} keys %pep){
my $hash2 = $pep{$key1};
my $maxlen = 0;
my $pepfin;
my $transfin;
for my $key2(sort{$hash2->{$a} <=> $hash2->{$b}} keys %$hash2){
# print OUT $key1."\t".$key2."\t".$hash2->{$key2}."\n";
my $len = length($hash2->{$key2});
if ($len > $maxlen){
$maxlen = $len;
$pepfin = $hash2->{$key2};
$transfin = $key2;
}
}
print OUT ">".$transfin." gene=".$key1."\n";
print OUT $pepfin."\n";
}
close IN;
close OUT;
????另外同學(xué)依據(jù)我的需求也寫了一個(gè)對(duì)應(yīng)功能的Python腳本,不得不承認(rèn)Python v3.0的字典在保留原順序輸出方面的能力就很強(qiáng),這個(gè)腳本幾乎對(duì)原文件沒有改動(dòng),很實(shí)用。有時(shí)間的話會(huì)對(duì)它進(jìn)行全方位的注釋學(xué)習(xí),希望自己能在Python腳本書寫能力上有所進(jìn)步【狗頭苦笑】!
import sys,getopt
def usage():
print('usage:python3 removeRedundantProteins.py -i <in_fasta> -o <out_fasta> <-h>')
return
def removeRedundant(in_file,out_file):
gene_dic = {}
flag = ''
with open (in_file) as in_fasta:
for line in in_fasta:
if '>' in line:
line1 = line.strip('>\n')
line2 = line1.split('.')
li = line2[0]
flag = li
try:
gene_dic[li]
except KeyError:
gene_dic[li] = [line]
else:
gene_dic[li].append(line)
else:
gene_dic[flag][-1] += line
with open (out_file,'w') as out_fasta:
for k,v in gene_dic.items():
if len(v) == 1:
out_fasta.write(gene_dic[k][0])
else:
trans_max = ''
for trans in gene_dic[k]:
a = len(list(trans))
b = len(list(trans_max))
if a > b:
trans_max = trans
out_fasta.write(trans_max)
def main(argv):
try:
opts, args = getopt.getopt(argv,'hi:o:')
except getopt.GetoptError:
usage()
sys.exit()
for opt, arg in opts:
if opt == '-h':
usage()
sys.exit()
elif opt == '-i':
in_fasta_name = arg
elif opt == '-o':
outfile_name = arg
try:
removeRedundant(in_fasta_name,outfile_name)
except UnboundLocalError:
usage()
return
if __name__ == '__main__':
main(sys.argv[1:])