「蛋白注釋」- 批量InterProScan注釋

Mar 20,2020 更新!

什么是InterProScan數(shù)據(jù)庫(kù)

InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites. To classify proteins in this way, InterPro uses predictive models, known as signatures, provided by several different databases (referred to as member databases) that make up the InterPro consortium. We combine protein signatures from these member databases into a single searchable resource, capitalising on their individual strengths to produce a powerful integrated database and diagnostic tool.

可以看出InterPro數(shù)據(jù)庫(kù)主要通過(guò)蛋白序列信息將蛋白分類(lèi),分為一個(gè)個(gè)家族。是一個(gè)很強(qiáng)大的數(shù)據(jù)庫(kù),具體的算法和實(shí)現(xiàn)邏輯不闡述,主要是我也不懂????,這里主要介紹這個(gè)數(shù)據(jù)庫(kù)的使用方法。
主要有三種方式:

  1. 在線網(wǎng)站注釋?zhuān)?a target="_blank">InterPro 」根據(jù)網(wǎng)站提示一步一步就會(huì)得到結(jié)果,這個(gè)也沒(méi)啥好講的,唯一的缺點(diǎn)就是沒(méi)有辦法批量做,每次只能輸入一個(gè)蛋白質(zhì)序列。
  2. 本地InterProScan:參考「Kai大神的blog」這個(gè)blog中寫(xiě)的很清楚,但是太耗存儲(chǔ)了,所以我也放棄了
  3. 利用EMBL-EBI,(最好搭梯子訪問(wèn))這個(gè)官方的黑科技很贊,下面主要說(shuō)一下怎么用這玩意(Macos下,linux下應(yīng)該也一樣,win不太清楚)

重新看了腳本發(fā)現(xiàn):

# General options
parser.add_option('-h', '--help', action='store_true', help='Show this help message and exit.')
parser.add_option('--email', help='E-mail address.')
parser.add_option('--title', help='Job title.')
parser.add_option('--outfile', help='File name for results.')
parser.add_option('--outformat', help='Output format for results.')
parser.add_option('--asyncjob', action='store_true', help='Asynchronous mode.')
parser.add_option('--jobid', help='Job identifier.')
parser.add_option('--polljob', action="store_true", help='Get job result.')
parser.add_option('--pollFreq', type='int', default=3, help='Poll frequency in seconds (default 3s).')
parser.add_option('--status', action="store_true", help='Get job status.')
parser.add_option('--resultTypes', action='store_true', help='Get result types.')
parser.add_option('--params', action='store_true', help='List input parameters.')
parser.add_option('--paramDetail', help='Get details for parameter.')
parser.add_option('--multifasta', action='store_true', help='Treat input as a set of fasta formatted sequences.')
parser.add_option('--useSeqId', action='store_true', help='Use sequence identifiers for output filenames.'
                                                          'Only available in multi-fasta and multi-identifier modes.')
parser.add_option('--maxJobs', type='int', help='Maximum number of concurrent jobs. '
                                                'Only available in multifasta or list file modes.')

parser.add_option('--quiet', action='store_true', help='Decrease output level.')
parser.add_option('--verbose', action='store_true', help='Increase output level.')
parser.add_option('--version', action='store_true', help='Prints out the version of the Client and exit.')
parser.add_option('--debugLevel', type='int', default=debugLevel, help='Debugging level.')
parser.add_option('--baseUrl', default=baseUrl, help='Base URL for service.')

(options, args) = parser.parse_args()

存在--multifasta的參數(shù),也就是說(shuō)沒(méi)必要把fasta成一個(gè)一個(gè)的。果然在后續(xù)的代碼中看到:

def multipleServiceRun(email, title, params, useSeqId, maxJobs, outputLevel):
    seqs = params['sequence']
    seqs = seqs.split(">")[1:]
    i = 0
    j = maxJobs
    done = 0
    jobs = []
    while done < len(seqs):
        c = 0
        for seq in seqs[i:j]:
            c += 1
            params['sequence'] = ">" + seq
            if c <= int(maxJobs):
                jobId = serviceRun(options.email, options.title, params)
                jobs.append(jobId)
                if outputLevel > 0:
                    if useSeqId:
                        print("Submitting job for: %s" % str(seq.split()[0]))
                    else:
                        print("JobId: " + jobId, file=sys.stderr)
        for k, jobId in enumerate(jobs[:]):
            if outputLevel > 0:
                print("JobId: " + jobId, file=sys.stderr)
            else:
                print(jobId)
            if useSeqId:
                options.outfile = str(seqs[i + k].split()[0])
            getResult(jobId)
            done += 1
            jobs.remove(jobId)
        i += maxJobs
        j += maxJobs
        time.sleep(pollFreq)

其實(shí)需要這幾個(gè)email, title, params, useSeqId, maxJobs, multifasta參數(shù)就可以了。

  • email 沒(méi)啥好解釋的,填寫(xiě)自己的email就可以了?!颈貙?xiě)】
  • useSeqId 參數(shù)解釋說(shuō)是用序列標(biāo)識(shí)(fasta文件">"后面的內(nèi)容,通俗說(shuō)就是基因名字)來(lái)作為輸出文件的文件名【選寫(xiě)】【推薦】
  • maxJobs 每次提交jobs數(shù)量,不建議太大,我嘗試的25沒(méi)有問(wèn)題,這個(gè)數(shù)量和你要做interproscan的基因總數(shù)無(wú)關(guān),可以不拆分fa文件,他會(huì)每次25個(gè)的給你提交。【推薦寫(xiě)上】
  • multifasta 這個(gè)是批量做最主要的參數(shù),如果做一個(gè)就是sequence批量就選multifasta后面直接跟 你要做InterProScan的蛋白序列。序列格式自己看腳本文件,開(kāi)頭有說(shuō)的。

??綜上所述我個(gè)人建議??快速批量??InterProScan的代碼就是??:

python ~/gitee/MyScript/iprscan5.py --multifasta itb.fa --maxJobs 25 --useSeqId --email myemail@126.com --outformat tsv &

-outformat tsv 表示只輸出tsv文件,這樣速度最快,如果輸出html等文件可能出現(xiàn)報(bào)錯(cuò)..還有一點(diǎn)tsv文件可以通過(guò)cat命令直接合并,最后統(tǒng)一處理。

---------------------------------?<下面步驟不推薦!>?-------------------------------------

步驟?

這里直接搬運(yùn)過(guò)來(lái)了

The Representational State Transfer (REST) sample clients are provided for a number of programming languages. For details of how to use these clients, download the client and run the program without any arguments.

Language Download Requirements
Perl iprscan5.pl LWP and XML::Simple
Python iprscan5.py xmltramp2

For details see Environment setup for REST Web Services and Examples for Perl REST Web Services Clients pages.

  • Perl用不明白,所以打算用Python的腳本,首先就是先下載上面的iprscan5.py,但是想用這個(gè)腳本需要先安裝依賴(lài):xmltramp2
pip install xmltramp2==3.0.10
  • 其實(shí)這個(gè)腳本也是一個(gè)一個(gè)提交到網(wǎng)站做InterProScan的,但是你可以自己寫(xiě)個(gè)python或者shell script實(shí)現(xiàn)批量化操作,但是需要提前把fasta文件分割成一條序列一個(gè)文件,這里拿來(lái)主義用的別人現(xiàn)成的腳本:參考blog中有下載鏈接
## 首先看下一共有幾條序列
grep '>' 01.lp.sss.candidate.fa | wc -l
# 279 ## 共有279條序列,那么就切成279個(gè)文件
perl fasta-splitter.pl --n-parts 279 01.lp.sss.candidate.fa
# 01.lp.sss.candidate.fa: 279 sequences, 122269 bp => dividing into 279 parts ...................................................................................................................................................................................................................................................................................... OK
# All done, 0 seconds elapsed
# 01.lp.sss.candidate.part-128.fa 01.lp.sss.candidate.part-269.fa
# 01.lp.sss.candidate.part-129.fa 01.lp.sss.candidate.part-270.fa
# 這里有點(diǎn)問(wèn)題一共最后分成了278個(gè),少一個(gè),打算揪出這個(gè)家伙。
for i in `ls *.part*`; do grep '>' $i > $i.checkWrong.xls; done
wc -l *.checkW*
## 2 01.lp.sss.candidate.part-270.fa.checkWrong.xls 
## 270這個(gè)有兩個(gè)
rm *.checkW*
less *270*
# >g35074 
# seq...
# >g47610
# seq...
# 手動(dòng)分開(kāi)吧,沒(méi)辦法
vim *270
# 復(fù)制第二個(gè)序列粘貼到新的01.lp.sss.candidate.part-270.2.fa里面
vim 01.lp.sss.candidate.part-270.2.fa
mv 01.lp.sss.candidate.part-270.fa 01.lp.sss.candidate.part-270.1.fa
## 這樣順序不會(huì)出錯(cuò)

## 強(qiáng)迫癥,想把文件改成序列ID寫(xiě)個(gè)腳本搞定他
## 1. 做個(gè)配置文件
ls *.part* > filenameOld ##先把舊文件的文件名保存到filenameOld中,
grep '>' 01.lp.sss.candidate.fa >seqID ## 再將序列名字保存到seqID中,vim打開(kāi)用 :1,$s/>//,刪掉>
## double check 序列和配置文件順序
for i in `ls *.part*`; do grep '>' $i >> checkID; done
paste checkID seqID > doublecheckSeqOrder
## 可以手動(dòng)檢查也可以用awk檢查
awk '{if ($1 == $2) print "True"; else print "False"}' doublecheckSeqOrder > checkresult
cat checkresult | sort |uniq -c
#  279 True
paste checkID filenameOld > ChangeName.config
# 刪掉 > 還是喜歡在vim下修改: :1,$s/>//g
vim ChangeName.sh
## 腳本內(nèi)容如下:
while read id
do
    sample=$(echo $id |cut -d" " -f 1 ) 
    oldfile=$(echo $id |cut -d" " -f 2 ) 
    echo $sample
    echo $oldfile
    
mv ${oldfile} ${sample}.fa
done <$1
## 加執(zhí)行權(quán)限
chmod +x ChangeName.sh
sh ChangeName.sh ChangeName.config
## done

單個(gè)序列測(cè)試

# 先跑一個(gè)測(cè)試下
python ~/10.InterproScan/iprscan5.py --sequence ../01.SeqSplit/g10038.fa --email myemail@126.com
JobId: iprscan5-R20200128-131432-0055-48082224-p2m
#RUNNING
#RUNNING
#RUNNING
#RUNNING
#RUNNING
#RUNNING
#RUNNING
#RUNNING
#RUNNING
#FINISHED
#Creating result file: iprscan5-R20200128-131432-0055-48082224-p2m.out.txt
#Creating result file: iprscan5-R20200128-131432-0055-48082224-p2m.log.txt
#Creating result file: iprscan5-R20200128-131432-0055-48082224-p2m.tsv.txt
#Creating result file: iprscan5-R20200128-131432-0055-48082224-p2m.xml.xml
#Creating result file: iprscan5-R20200128-131432-0055-48082224-p2m.htmltarball.html.tar.gz
#Creating result file: iprscan5-R20200128-131432-0055-48082224-p2m.gff.txt
#Creating result file: iprscan5-R20200128-131432-0055-48082224-p2m.svg.svg
#Creating result file: iprscan5-R20200128-131432-0055-48082224-p2m.sequence.txt
#Creating result file: iprscan5-R20200128-131432-0055-48082224-p2m.json.txt

批量進(jìn)行

## 配置文件
ls ../02.SeqSplit/*.fa > config
## 腳本
while read id
do
    file=$(basename $id ) 
    sample=${file%%.*} 
    echo $id $sample 
    python  ~/10.InterproScan/iprscan5.py --sequence $id --email shawnwang2016@126.com --outfile ${sample}
done <$1
## 執(zhí)行
chmod +x BatchInterProScan.sh
nohup sh BatchInterProScan.sh config &
## done

當(dāng)然也可以限定輸出形式,--outfmt具體vim 一下iprscan5.py都有,然后就是漫長(zhǎng)的等待結(jié)果了。

??完了,碎一覺(jué)明天看結(jié)果??

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容