http://www.itdecent.cn/p/19241ae1b77a
一、參考文檔
https://interproscan-docs.readthedocs.io/en/latest/
https://github.com/ebi-pf-team/interproscan
前言
Interproscan,通過蛋白質(zhì)結(jié)構(gòu)域和功能位點數(shù)據(jù)庫預(yù)測蛋白質(zhì)功能。是EBI開發(fā)的一個集成了蛋白質(zhì)家族、結(jié)構(gòu)域和功能位點的非冗余數(shù)據(jù)庫,該數(shù)據(jù)庫集成了一系列的數(shù)據(jù)庫,如常見的pfam和GO注釋,本文將介紹如何在本地集群上搭建Interproscan數(shù)據(jù)庫,并如何配置SGE的投遞參數(shù)
安裝環(huán)境需要
本文示例InterProScan 5.47-82.0的安裝使用,首先需要配置以下要求的perl、python3、JAVA JDK,并分別添加到~/.bashrc中
64-bit Linux
Perl (default on most Linux distributions)
Python 3 (InterProScan 5.30-69.0 onwards)
Java JDK/JRE version 11 (InterProScan 5.37-76.0 onwards)
下載和安裝
5.47-82.0版本開始不需要單獨下載panther,數(shù)據(jù)比較大,小編選擇用的迅雷VIP下載的(不到不得已不要VIP,TM巨坑)
mkdir my_interproscan
cd my_interproscan
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.55-88.0/interproscan-5.55-88.0-64-bit.tar.gz
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.55-88.0/interproscan-5.55-88.0-64-bit.tar.gz.md5
# checksum 是為了驗證文件下載的完整性,interproscan 比較大,驗證是為了減少后續(xù)軟件缺省的麻煩:
md5sum -c interproscan-5.55-88.0-64-bit.tar.gz.md5
# Must return *interproscan-5.55-88.0-64-bit.tar.gz: OK*
# 如果 failed 則需要重新下載.
tar -pxvzf interproscan-5.55-88.0-*-bit.tar.gz
# where:
# p = preserve the file permissions
# x = extract files from an archive
# v = verbosely list the files processed
# z = filter the archive through gzip
# f = use archive file
Index hmm models
Before you run interproscan for the first time, you should run the command:
python3 setup.py -f interproscan.properties
This command will press and index the hmm models to prepare them into a format used by hmmscan.
此時該數(shù)據(jù)庫可以進行使用了
To turn off the use of the service, either use the -dp command line option or edit interproscan.properties and add a # to the start of the following line to comment out the line or delete the following line, near the bottom of the file:
因為小編只要在本地運行,不需要聯(lián)網(wǎng)操作,所以編輯interproscan.properties 文件,將下邊這行代碼注釋掉
precalculated.match.lookup.service.url=http://www.ebi.ac.uk/interpro/match-lookup
數(shù)據(jù)庫使用
進入interproscan-5.47-82.0文件夾后可以拿里邊的測試數(shù)據(jù)進行測試
./interproscan.sh #直接運行該腳本可以看到該數(shù)據(jù)庫的各個使用參數(shù),其中:
-i:輸入文件
-o:輸出文件,和-b參數(shù)(輸出另一種格式)不能同時使用
-iprlookup -goterms:同時使用顯示GO注釋的GO ID(最好加上)
-f:輸出文件格式,有TSV, XML, JSON, GFF3, HTML和SVG幾種, TSV格式對于生信分析最為友好,蛋白序列默認 TSV, XML和GFF3,核酸默認GFF3和XML
-T :臨時文件生產(chǎn)路徑,默認當前文件夾下生產(chǎn)temp(cluster模式會生產(chǎn))
-appl:能夠直接使用的數(shù)據(jù)庫有SFLD,ProDom,Hamap,SMART,CDD,ProSiteProfiles,ProSitePatterns,SUPERFAMILY,PRINTS,PANTHER,Gene3D,PIRSF,Pfam,Coils,MobiDBLite,該參數(shù)默認選擇全部可用數(shù)據(jù)庫
其他參數(shù)可參考官網(wǎng):Running InterProScan 5 — interproscan-docs documentation
運行腳本:
./interproscan.sh -iprlookup -goterms -appl Pfam -appl PRINTS -appl PANTHER -appl ProSiteProfiles -appl SMART -f TSV -i test_proteins.fasta -o test_proteins.fasta.ipscan
直接運行該腳本即可得到TSV格式的結(jié)果文件,-appl也可寫成-appl Pfam,PRINTS,PANTHER,ProSiteProfiles,SMART
SGE投遞參數(shù)設(shè)置
單個和少數(shù)幾個序列在本地直接運行會很方便,但是當序列比較大的時候,使用投遞腳本會方便的多,這時需要切換到cluster模式,該數(shù)據(jù)庫支持SGE和LSF投遞,因為小編用的SGE,所以接下來介紹SGE投遞參數(shù)配置
編輯interproscan.properties文件,添加或修改以下參數(shù):
#Specify your cluster (LSF, SGE or any other cluster)
grid.name=sge #修改
#grid.name=other-cluster
#Java Virtual Machine (JVM) maximum idle time for jobs.
#Default is 180 seconds, if not specified. When reached the worker will shutdown.
jvm.maximum.idle.time.seconds=180 #添加
#JVM maximum life time for workers.
#Default is 14400 seconds, if not specified. After this period has passed the worker will shutdown unless it is busy.
jvm.maximum.life.seconds=14400 #添加
#Maximum number of jobs per clusterRunId. Default is 3000.
grid.jobs.limit=1000 #根據(jù)實際情況修改
#commands to start new jvmsworker.
command=java -Xms256m -Xmx1024m -jar interproscan-5.jar #修改worker.high.memory.command=java -Xms256m -Xmx2048m -jar interproscan-5.jar #修改
#directory for any log files generated by InterProScan
log.dir=logs #添加
下邊是SGE投遞命令,-l和-q參數(shù)可以根據(jù)實際情況設(shè)置,小編四萬八千多個蛋白序列,拆成46個任務(wù),用的vf=2.5g,p=1夠用
grid.master.submit.command=qsub -cwd -V -l vf=2.5g,p=1 -q all.q -b y -N i5t1worker
grid.master.submit.high.memory.command=qsub -cwd -V -l vf=2.5g,p=1-q all.q -b y -N i5t1hmworker
grid.worker.submit.command=qsub -cwd -V -l vf=2.5g,p=1 -q all.q -b y -N i5t2worker
grid.worker.submit.high.memory.command=qsub -cwd -V -l vf=2.5g,p=1 -q all.q -b y -N i5t2hmworker
投遞運行腳本:
./interproscan.sh -mode cluster -clusterrunid test -iprlookup -goterms -appl Pfam -appl PRINTS -appl PANTHER -appl ProSiteProfiles -appl SMART -f TSV -i test_proteins.fasta - T temp -o test_proteins.fasta.ipscan2
運行該腳本會自動進行投遞,和本地運行相比,多了-mode cluster參數(shù)(必選),該參數(shù)表示選擇cluster模式,-clusterrunid test參數(shù)(必選),該參數(shù)是log文件中的日志文件名稱,test可更改成自己喜歡的名字,-T表示臨時文件。
安裝過程中若遇到什么問題,可以參考官方教程:Running InterProScan 5 in Cluster Mode — interproscan-docs documentation
5. 運行
./interproscan.sh
四、問題解決
1 軟件缺失
Deactivated analyses:
SignalP_GRAM_POSITIVE (X.X) : Analysis SignalP_GRAM_POSITIVE-X.X is deactivated, because the following parameters are not set in the interproscan.properties file: binary.signalp.4.0.path
SignalP_EUK (X.X) : Analysis SignalP_EUK-X.X is deactivated, because the following parameters are not set in the interproscan.properties file: binary.signalp.4.0.path
Phobius (X.XX) : Analysis Phobius-X.XX is deactivated, because the following parameters are not set in the interproscan.properties file: binary.phobius.pl.path.1.01
TMHMM (X.Xc) : Analysis TMHMM-X.Xc is deactivated, because the following parameters are not set in the interproscan.properties file: binary.tmhmm.path, tmhmm.model.path
SignalP_GRAM_NEGATIVE (X.X) : Analysis SignalP_GRAM_NEGATIVE-X.X is deactivated, because the following parameters are not set in the interproscan.properties file: binary.signalp.4.0.path
運行完后,在一系列參數(shù)說明的結(jié)尾,會提示有某些軟件無法獲得,這些軟件需要自行前往官網(wǎng)注冊并下載,并將下載軟件添加到 interproscan 相應(yīng)目錄下
- signalp-4.1
SignalP-4.1 download - tmhmm-2.0c
tmhmm-2.0c - phobius1.01
phobius1.01 download
2 報錯提醒
- signalp-4.1報錯
SignalP Error message: Can't locate FASTA.pm in @INC
- 軟件路徑需要更改
- 修改文件signalp
#原文件路徑
BEGIN {
$ENV{SIGNALP} = '/usr/opt/www/pub/CBS/services/SignalP-4.1/signalp-4.1';
}
#修改為自己的路徑
BEGIN {
$ENV{SIGNALP} = '/home/usr/bacteria/app/interproscan2/interproscan-5.55-88.0/bin/signalp/4.1/';
}
- hmm model報錯
pather.hmm bad file format…………
- 可能是軟件解壓過程hmm model 不完全的問題導(dǎo)致的
- 需要回到最初解壓軟件壓縮包的步驟;修改
interproscan-5.55-88.0-64-bit.tar.gz權(quán)限
chmod 777 interproscan-5.55-88.0-64-bit.tar.gz
-
phobius1.01
報錯Could not read provided fasta sequence at bin/phobius/1.01/phobius.pl line 408Could not read provided fasta sequence at bin/phobius/1.01/phobius.pl line 408phobius 版本與Linux 操作系統(tǒng)版本(64位)不對應(yīng),一般而言 phobius 默認的是32 位的檢查是否是32bit
file bin/phobius/1.01/decodeanhmm |grep 32-bit如果是32 bit,則需要將decodeanhmm 刪除,并將decodehmm.64bit 文件改名為decodeanhmm
總結(jié)
以上僅個人安裝 interproscan 過程中踩過的坑及解決方案,希望能夠為大家提供幫助~
有錯誤的地方希望大家指出,彼此交流,共勉!
作者:努力的豬豬包
鏈接:http://www.itdecent.cn/p/1bc986ee1a9e
來源:簡書
著作權(quán)歸作者所有。商業(yè)轉(zhuǎn)載請聯(lián)系作者獲得授權(quán),非商業(yè)轉(zhuǎn)載請注明出處。