一.流程介紹
后續(xù)會開發(fā)lncRNA、smallRNA、circRNA流程
首先是lncRNA流程,主要分析模塊為:
- 測序數(shù)據(jù)質(zhì)控;
- 序列比對分析;
- 轉(zhuǎn)錄本組裝;
- lncRNA分析;
- 表達(dá)量分析;
- 表達(dá)量差異分析;
- lncRNA靶基因分析;
- 富集分析
1.使用基礎(chǔ)ubuntu鏡像:
docker run -itd --name lnc ubuntu:18.04
2.安裝json環(huán)境配置環(huán)境
換源:
docker cp sources.list 1a73843e30e5:/etc/apt
apt-get update
apt-get upgrade
lncRNA流程前幾個模塊與有參流程模塊腳本基本一樣,軟件使用相同,直接復(fù)制原來的軟件到lnc環(huán)境中,并安裝即可。

3.測試腳本
01-qc.sh、01-qcStat.sh、02-alignHST.sh、02-alignSummary.sh
使用FastQC質(zhì)控與Hisat2比對;
使用stringtie進(jìn)行轉(zhuǎn)錄本拼接與組裝,分為得到mRNA序列與注釋文件與lncRNA序列與注釋文件
03-assemblyStie.sh腳本:
for sample in $samplenames
do
delay stringtie
cd $sample
mkdir AssemblyStielncRNA
cd AssemblyStielncRNA
algn_bam=../AlignmentHST/$sample.bam
echo "Transcript assembly for $sample at $(date)"
$stringtie -p $stringtie_threads -G $lnc_gtf \
-o transcripts.gtf -A $sample.exp -l $sample $algn_bam &
cd ../
mkdir AssemblyStieRNA
cd AssemblyStieRNA
algn_bam=../AlignmentHST/$sample.bam
echo "Transcript assembly for $sample at $(date)"
$stringtie -p $stringtie_threads -G $genome_gtf \
-o transcripts.gtf -A $sample.exp -l $sample $algn_bam &
cd ../../
done
根據(jù)lncRNA注釋文件與mRNA注釋文件,拼接得到每個樣本的轉(zhuǎn)錄本注釋文件
mkdir MergedGTFlncRNA_Stie
find */AssemblyStielncRNA/ -name "transcripts.gtf" -type f -printf "%p \n" >transcripts_file-lncRNA.list
$stringtie --merge -p $stringtie_threads -G $lnc_gtf -o MergedGTFlncRNA_Stie/mergedlncRNA.gtf transcripts_file-lncRNA.list
# #$gffcompare -r $genome_gtf -G -o MergedGTFlncRNA_Stie/merged_compare MergedGTFlncRNA_Stie/mergedlncRNA.gtf
mkdir MergedGTFRNA_Stie
find */AssemblyStieRNA/ -name "transcripts.gtf" -type f -printf "%p \n" >transcripts_file-mRNA.list
$stringtie --merge -p $stringtie_threads -G $genome_gtf -o MergedGTFRNA_Stie/mergedmRNA.gtf transcripts_file-mRNA.list
$cufflinks/gffread -w MergedGTFlncRNA_Stie/transcripts-lncRNA.fa -g $genome_fasta MergedGTFlncRNA_Stie/mergedlncRNA.gtf &
$cufflinks/gffread -w MergedGTFRNA_Stie/transcripts-mRNA.fa -g $genome_fasta MergedGTFRNA_Stie/mergedmRNA.gtf &
分別得到所有樣本lncRNA與mRNA總注釋文件,下一步計算表達(dá)量
for sample in $samplenames
do
delay stringtie
cd $sample
mkdir ExpressionStielncRNA
cd ExpressionStielncRNA
algn_bam=../AlignmentHST/$sample.bam
echo "Start estimating for $sample at $(date)"
$stringtie -e -B -p $stringtie_threads -G ../../MergedGTFlncRNA_Stie/mergedlncRNA.gtf \
-o $sample.gtf -A $sample.exp $algn_bam &
cd ../
mkdir ExpressionStieRNA
cd ExpressionStieRNA
algn_bam=../AlignmentHST/$sample.bam
echo "Start estimating for $sample at $(date)"
$stringtie -e -B -p $stringtie_threads -G ../../MergedGTFRNA_Stie/mergedmRNA.gtf \
-o $sample.gtf -A $sample.exp $algn_bam &
cd ../../
done
得到結(jié)果如下:

4.接下來使用DEseq2或edger進(jìn)行差異分析,可參考有參流程分析:
(1)有生物學(xué)重復(fù)的同時滿足 log2FoldChange 絕對值大于等于 1 且 pvalue 小于 0.05 的基因 significant 標(biāo)為 yes,否則標(biāo)為 no;(2)沒有生物學(xué)重復(fù)的同時滿足 log2FoldChange 絕對值大于等于 1 且 pvalue 小于 1 的基因 significant 標(biāo)為 yes,否則標(biāo)為 no。同時我們會根據(jù)初始運算結(jié)果,對差異閾值做進(jìn)一步定義或者給予建議,例如:1. fold change 閾值更改;2. 統(tǒng)計檢驗閾值(pvalue)更改。滿足顯著性差異閾值的差異基因在分析結(jié)果中通過 significant 參數(shù)一列為 yes 顯現(xiàn)。
5.使用clusterProfiler包 進(jìn)行KEGG與GO富集分析;
- lncRNA靶基因分析;
10.23更新:
LncRNA預(yù)測分析:
使用軟件:
CNCI;
CPC2;
LGC;
CNCI
CNCI是中科院計算所趙屹團(tuán)隊開發(fā)的一款從轉(zhuǎn)錄組中分析編碼RNA和非編碼RNA的軟件。趙屹團(tuán)隊在非編碼RNA領(lǐng)域做了很多出色的工作,建立了目前最權(quán)威的非編碼RNA數(shù)據(jù)庫NONCODE。
github地址:
https://github.com/www-bioinfo-org/CNCI
安裝:
git clone [git@github.com](mailto:git@github.com):www-bioinfo-org/CNCI.git
cd CNCI
unzip libsvm-3.0.zip
cd libsvm-3.0
make
cd ..
或者下載到服務(wù)器,unzip解壓
基本命令為:
python CNCI_package/CNCI.py -f novel.fasta -o CNCI_out -m ve -p 4
參數(shù)說明:
-f 輸入fasta文件(可以使用-g參數(shù)輸入GTF文件,但是同時需要使用-d參數(shù)指定參考基因組的目錄)
-o 輸出結(jié)果目錄
-m 指定模式,脊椎動物選擇ve,植物選擇pl
-p 指定CPU核數(shù)
更多用法參看幫助文檔
結(jié)果文件:

CPC2
CPC2 下載安裝
https://github.com/gao-lab/CPC2_standalone/releases/tag/v1.0.1
安裝:
gzip -dc CPC2-beta.tar.gz | tar xf -
cd CPC2-beta
export CPC_HOME="$PWD"
cd libs/libsvm
gzip -dc libsvm-3.18.tar.gz | tar xf -
cd libsvm-3.18
make clean && make
cd $CPC_HOME
bin/CPC2.py -i (input_seq) -o (result_in_table)

運行命令:
python3 CPC2.py -i transcripts-lncRNA.fa
結(jié)果展示:

LGC
LGC是由北京基因組所基于python2 開發(fā)的一款快速lncRNA預(yù)測工具,該工具通過ORF(開放閱讀框)長度和GC含量間的關(guān)系進(jìn)行相關(guān)運算來鑒定lncRNA。LGC最大特點是能夠基于跨物種策略進(jìn)行l(wèi)ncRNA發(fā)掘。因此LGC可以支持有參數(shù)據(jù)和無參數(shù)據(jù) 進(jìn)行l(wèi)ncRNA鑒定。在區(qū)分從植物到哺乳動物的不同物種的lncRNA和蛋白編碼RNA方面,LGC鑒定的準(zhǔn)確率高達(dá)90%。

下載與安裝:
wget [http://bigd.big.ac.cn/biocode/tools/4/releases/1.0/file/download?filename=lgc-1.0.tar.gz](http://bigd.big.ac.cn/biocode/tools/4/releases/1.0/file/download?filename=lgc-1.0.tar.gz)
tar zxf lgc-1.0.tar.gz
chmod 755 lgc-1.0.py
#確保conda,lgc-1.0.py在環(huán)境變量里
lgc-1.0.py input.fasta output.txt
# Or
python lgc-1.0.py input.fasta output.txt
結(jié)果展示:


二.nextflow使用

這里使用 Nextflow 作為流程搭建工具,它有著很多強大的功能,第一次接觸,首先先安裝:
wget -qO- [https://get.nextflow.io](https://get.nextflow.io/) | bash
或者使用conda安裝
conda install nextflow
官方參考文檔:
https://www.nextflow.io/docs/latest/getstarted.html
這里可以直接使用
直接使用已有l(wèi)ncRNA流程進(jìn)行測試,參考地址
或者使用下面代碼使用rna流程:
nextflow pull nf-core/rnaseq
下載完流程之后:

運行文件為nf結(jié)尾文件,
文件內(nèi)容如下:
#!/usr/bin/env nextflow
import SampleGroup
import Config
/*------------------RNA分析nextflow-pipeline初版-----------------------------
RNA_np_v0.0.1
System: Linux version 3.10.0-693.21.1.el7.x86_64
Tool : Nextflow 0.30.2.4867
-------------------------------------------------------------------------*/
/*---------------------------定義基本目錄-----------------------------------
base pathway
database: 與人hg19相關(guān)的數(shù)據(jù)、工具存儲的目錄
opd: output pathway本項目的輸出文件母目錄
-------------------------------------------------------------------------*/
database="/data/zhangxc/database/homo_sapiens"
opd = "/data/zhangxc/lnc_rna/lncRNA/output"
main_dir = "/data/zhangxc/lnc_rna/lncRNA"
/*-------------------------讀取樣本及定義組---------------------------------
sample and group
data.ini為樣本和組的配置文件,通過congfig.groovy將定義好的樣本和組讀入pipline中備用
sample = {{sample_name},{sample_file1,sample_file2}}
group = {{group_name},{control_name},{case_name},{control_name,case_name}}
nextflow 同一命名變量只允許一次作為輸入變量
sample 擴展成3份 分別用來進(jìn)行fastqc、SOAPnuke、println
-------------------------------------------------------------------------*/
params.data_file="/data/zhangxc/lnc_rna/lncRNA/test/data.ini"
config=new Config(params.data_file)
Channel.from( config.ReadSamples() ).set{samples}
samples.into{samples0;samples1;sample_2}
Channel.from( config.ReadGroups() ).set{groups0}
sample_2.println()
/*------------------------------fastqc------------------------------------
fastqc
質(zhì)量控制模塊,獨立于分析流程的單獨模塊
input : sample
output: otp/fastqc/sample_name
-------------------------------------------------------------------------*/
process fastqc{
tag { sample_name }
// container 'biocontainers/fastqc'
// conda "fastqc"
publishDir { "output/fastqc/"+ sample_name }
input:
set sample_name , files from samples0
output:
file "*_fastqc/Images/*.png"
script:
if(files.size == 1){
"""
echo \$PWD
fastqc --extract -o . ${files[0]}
"""
}else{
"""
echo \$PWD
fastqc --extract -o . ${files[0]} ${files[1]}
"""
}
}
/*------------------------------soapnuke------------------------------------
soapnuke
質(zhì)量控制模塊,開始與分析第一步,用來去除引物生成clean sample
input : sample
output: file otp/soapnuke/sample_name/*.fq.gz val clean_samples-->{{sample_name},{*.fq.gz}}
clean_samples copy to clean_samples1|clean_samples2
-------------------------------------------------------------------------*/
process soapnuke{
tag { sample_name }
// container 'registry.cn-hangzhou.aliyuncs.com/bio/soapnuke'
publishDir path:{"output/soapnuke/"+sample_name} ,mode:'copy',
saveAs: {filename ->
if (filename.indexOf(".txt")>0) filename
else null
}
input:
set sample_name , files from samples1
output:
set sample_name , file("*.fq.gz") into clean_samples
file "*.txt"
script:
"""
echo \$PWD
SOAPnuke filter -1 ${files[0]} -2 ${files[1]} -l 20 -q 0.5 \
-f GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -r GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
-o ./ -C ${sample_name}_1.clean.fq.gz -D ${sample_name}_2.clean.fq.gz
"""
}
clean_samples.into{clean_samples1;clean_samples2;}
/*------------------------------hisat------------------------------------
hisat
比對軟件,與bowtie和tophat功能相似,將read與基因組作比對
input : clean_samples1
output: file otp/hisat/sample_name/${sample_name}.sam val sam-->{{sample_name},{${sample_name}.sam}}
file otp/hisat/sample_name/*.align_summary.txt val summary_txt-->{{sample_name},{*.align_summary.txt}}
-------------------------------------------------------------------------*/
process hisat{
tag { sample_name }
// container 'registry.cn-hangzhou.aliyuncs.com/bio/hisat2'
publishDir { "output/hisat/"+ sample_name }
input:
set sample_name , files from clean_samples1
output:
set sample_name , file("${sample_name}.sam") into sam
set sample_name , file("*.align_summary.txt") into summary_txt
script:
"""
echo \$PWD
hisat2 -x $database/grch37_tran/genome_tran -1 ${files[0]} -2 ${files[1]} -S ${sample_name}.sam 2> ${sample_name}.align_summary.txt
"""
}
/*------------------------------samtools----------------------------------
samtools
工具軟件,將sam轉(zhuǎn)為bam格式,并根據(jù)染色體定位進(jìn)行sort和index
input : val sam
output: file otp/samtools/sample_name/${sample_name}.bam val sam-->{{sample_name},{${sample_name}.bam}}
-------------------------------------------------------------------------*/
process samtools{
tag { sample_name }
// container 'broadinstitute/genomes-in-the-cloud:2.3.1-1512499786'
publishDir { "output/samtools/"+ sample_name }
input:
set sample_name , file('sample.sam') from sam
output:
set sample_name , file("${sample_name}.bam") into bam
script:
"""
echo \$PWD
samtools view -bS sample.sam > ${sample_name}_unsorted.bam
samtools sort -@ 8 ${sample_name}_unsorted.bam ${sample_name}
samtools index ${sample_name}.bam ${sample_name}.bam.bai
"""
}
/*------------------------------htseq----------------------------------
htseq
轉(zhuǎn)錄組表達(dá)量分析
input : val bam
output: file otp/htseq/sample_name/${sample_name}.rawCount.txt val htseq_txt-->{{sample_name},{${sample_name}.rawCount.txt}}
htseq_txt copy to htseq_txt0|htseq_txt1|htseq_txt2
-------------------------------------------------------------------------*/
process htseq{
tag { sample_name }
//container 'broadinstitute/genomes-in-the-cloud:2.3.1-1512499786'
//container "dyndna/docker-for-pcawg14-htseq"
// conda "htseq=0.9.1"
publishDir { "output/htseq/"+ sample_name }
input:
set sample_name , file('bam_files') from bam
output:
set sample_name , file("${sample_name}.rawCount.txt") into htseq_txt
script:
"""
echo \$PWD
htseq-count -f bam -m union -s yes -t exon -i gene_id -r pos bam_files \
$database/Homo_sapiens.GRCh37.75.gtf >${sample_name}.rawCount.txt
"""
}
htseq_txt.into{htseq_txt0;htseq_txt1;htseq_txt2}
htseq_txt2.println()
/*------------------------------htseq_xls----------------------------------
htseq_xls
轉(zhuǎn)錄組表達(dá)量分析后,將txt文件轉(zhuǎn)換格式為xls
input : val summary_txt
val htseq_txt0
output: file otp/htseq_xls/sample_name/${sample_name}.rpkm.xls val htseq_xls-->{{sample_name},{${sample_name}.rpkm.xls}}
-------------------------------------------------------------------------*/
process htseq_xls{
tag { sample_name }
//container 'registry.cn-hangzhou.aliyuncs.com/bio/hisat2'
// container "broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 "
publishDir { "output/htseq/"+ sample_name }
input:
set sample_name , file('txt_files') from summary_txt
set sample_name , file('sample.rawCount.txt') from htseq_txt0
output:
set sample_name , file("${sample_name}.rpkm.xls") into htseq_xls
script:
"""
echo \$PWD
perl /data/zhangxc/lnc_rna/lncRNA/bin/RNAseq_bin/htseq2rpkm_hisat.pl txt_files \
$database/hg19.GRCh37.74.ncRNA.gene.len \
sample.rawCount.txt $sample_name ${sample_name}.rpkm.xls
sed 's/^/$sample_name\t/' ${sample_name}.rpkm.xls | sed '1d' > ${sample_name}.rpkm.tmp
"""
}
只截取到基因表達(dá)量部分,感興趣的可以去嘗試運行,運行方式為:
nextflow run zxc.nf
