teprof2轉錄組數(shù)據(jù)鑒定轉座子融合嵌合轉錄本

參考資料

ng

轉座因子(TE)是豐富的遺傳學資源,其中包含著大量的調控序列。在TE內部存在著潛在的隱性調控元件,在生理和病理中這些元件可以被表觀遺傳學機制重新激活,從而影響疾病的發(fā)生和發(fā)展。然而,不同癌癥類型中TE相關的癌癥外源適應事件的普遍性和影響仍然缺乏充分的了解。
本教程主要翻譯上面文章的教程,本人經過測試運行穩(wěn)定后,進行代碼的書寫,有些地方沒有進行翻譯,感覺英文版的更加原汁原味,翻譯過來就說不清楚了,

分析步驟

下載文章中的源代碼https://github.com/twlab/TEProf2Paper

image.png

一、配置teprof2環(huán)境

注意組織好文件

mkdir {sra,raw,clean,align,peaks,motif,qc}
cd sra 

需要的軟件

stringtie >= 1.3.3
samtools >= 1.3.1
cufflinks >= 2.2.1
python 2.7 (cPickle, pytabix 0.1)
R >= 3.4.1 (ggplot2, bsgenome.hsapiens.ucsc.hg38 (or genome of your choosing), Xmisc, reshape2)

,,,以上可單獨安裝。該管道僅針對列出的版本進行了測試,可能會與較新的版本發(fā)生沖突。

conda配置環(huán)境

conda env create -f TEProf2.yml

激活環(huán)境

source activate
conda activate teprof2

然后需要手動安裝R包Xmisc

install.packages('Xmisc')

如果由于最近從CRAN中刪除了Xmisc,因此上述操作不起作用,可以通過devtools安裝它。退出R,并運行以下命令。

conda install -c conda-forge r-devtools

開啟R

R

加載devtools并確保環(huán)境變量設置正確

>Sys.setenv(TAR = "/bin/tar")
>install_url("https://cran.r-project.org/src/contrib/Archive/Xmisc/Xmisc_0.2.1.tar.gz")

把 bin 文件加入到 PATH

echo 'export PATH=/mnt/d/D/temp/transposable.elements/twlab-TEProf2Paper-3bb4b8d/bin:$PATH'  >> ~/.bashrc
source ~/.bashrc

二、下載Reference Files

作者已經為hg38、hg19和mm10創(chuàng)建了一組默認的參考文件。為了簡單使用,下載該目錄,將其放在rnapipeline文件夾中,然后解壓縮。確保使用這些文件的完整路徑更新arguments.txt文件(將在下面的用法中解釋)。
Download Link hg38: External Download Link
Download Link hg19: External Download Link
Download Link mm10: External Download Link

設置arguments.txt

參考文件的位置需要在一個以制表符分隔的文件中指定。不應該有額外的行或標題。下面是一個例子:

rmsk /bar/nshah/reference/rmskhg38.bed6.gz
rmskannotationfile /bar/nshah/reference/repeatmasker_description_uniq.lst
gencodeplusdic /bar/nshah/reference/genecode_plus_hg38.dic
gencodeminusdic /bar/nshah/reference/genecode_minus_hg38.dic
focusgenes /bar/nshah/reference/oncogenes_augmented.txt
plusintron /bar/nshah/reference/gencode.v25.annotation.sorted.gtf_introns_plus_sorted.gz
minusintron /bar/nshah/reference/gencode.v25.annotation.sorted.gtf_introns_minus_sorted.gz

Required Arguments

rmsk: tabix formatted bed6 files with repeatmasker or other file that user wants to check start location for
rmskannotationfile: a tab delimitted with 3 columns: subfamily class family
gencodeplusdic: Dictionary of all gencode elements including introns and exons for the plus (+) strand
gencodeminusdic: Dictionary of all gencode elements including introns and exons for the minus (-) strand

三、公共數(shù)據(jù)庫下載輸入文件

1、下載公共數(shù)據(jù)

cd /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/
prefetch -X 300GB -O ./ --option-file SRR_Acc_List.txt
ls /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/sra/SRR*/*.sra | cut -d "/" -f 10 | while read id; do (fasterq-dump --split-files -e 24 /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/sra/$id --include-technical); done

2、trim_galore測序數(shù)據(jù)的過濾

cd /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/
find . -name "*.fastq" | while read file ; do echo "trim_galore -q 25 --phred33 --length 25 -e 0.1 --stringency 4 -o ./clean "$file >> trim_galore.txt ; done ;
parallel -j 10 < trim_galore.txt

3、STAR進行比對(STAR只能在linux系統(tǒng)盤里運行,不能掛在mnt (占內存比較大建議一個個的運行,不建議使用并行運算Parallel))

cd /0.clear_fq/
for file in $(ls | grep _1_trimmed.fq.gz)
        do
        pre_name=${file%_1_trimmed.fq.gz}
        echo $pre_name
STAR --runThreadN 20 --genomeDir /StarIndex \
 --runMode alignReads \
 --quantMode TranscriptomeSAM GeneCounts \
 --readFilesIn  /0.clear_fq/${pre_name}_1_trimmed.fq.gz \
 --outSAMtype BAM SortedByCoordinate \
 --readFilesCommand gunzip -c \
 --outSAMattrIHstart 0  \
 --outSAMstrandField intronMotif \
 --outFileNamePrefix  ${pre_name} 

4、由于STAR只能不能再mnt盤使用,所以將比對的文件移到指定的位置并重新命名

cat sample.id.txt | while read id; do      
 arr=(${id});
 input=${arr[0]};
 sample=${arr[1]};
 str=`echo $sample | sed 's/ //g'`;
 cp ${arr[0]}Aligned.sortedByCoord.out.bam /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/motif/${str}.bam; done

5、stringtie比對轉錄本(內存大如果足夠大,可并行運算)

cd /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/motif/
find . -name "*.bam" | while read file ;do fileid=$(echo "$file" | awk -F "/" '{print $2}'); echo "stringtie -o /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/peaks/"${fileid}".gtf "${file}" -m 100 -c 1" >> stringtie.txt ; done ;
parallel -j 10 < stringtie.txt

6、Annotate GTF Files

cd /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/peaks/
find . -name "*.gtf" | while read file ; do echo "rmskhg38_annotate_gtf_update_test_tpm.py "$file >> 1_annotationCommands.txt ; done ;
echo "(1/8) Annotate GTF Files";
parallel -j 10 < 1_annotationCommands.txt

7、對bam文件建立索引([sam view] random alignment retrieval only works for indexed BAM or CRAM files)

cd /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/motif/
##更改名字
find . -name "*.Aligned.sortedByCoord.out.bam" | while read file ;do fileid=$(echo "$file" | awk -F "/" '{print $2}'|awk -F "." '{print $1}'); mv ${file} "${fileid}".bam; done
find . -name "*.bam" | while read file ; do echo "samtools index" $file >> samtooks.txt ; done
parallel -j 10 < samtooks.txt

處理注釋文件,粗略估計轉錄本相對于所有其他基因轉錄本的相對表達量。處理注釋文件,粗略估計轉錄本相對于所有其他基因轉錄本的相對表達量

cd /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/eu.ec/
find . -name "*_annotated_filtered_test_all" | while read file ; do echo "annotationtpmprocess.py "$file >> 2_processAnnotationCommands.txt ; done ;
echo "(2/8) Annotate GTF Files"
parallel -j 10 < 2_processAnnotationCommands.txt
Output File(s):
(1) <gtffile>annotated_filtered_test(all)_c

Three additional columns are made: (1) Maximum coverage of gene of interest (2) Maximum tpm of gene of interest (3) Total of gene of interest. Note: For those transcripts that are from a TE but do not splice into a gene, these stats will be compared across all the transcripts like this. Thus the fraction will be very low.

跨樣本聚合注釋樣本

echo "(3/8) Aggregate Annotations"
aggregateProcessedAnnotation.R -e "Ec"
Options with Defaults:
-e <ext_treatment> (default: ''): The label in the treatment file names that will identify them. If there is not a treatment versus untreated experimental design, the default will call everything as treatment.
-l <exon 1 length max> (default: 2588): The maximum length of exon 1. We are using the 99th percentile of gencode v25 transcripts
-s <exon skipping max> (default: 2): Based on genomic contamination, assembly can create transcripts that have intron retention. These could be real, but oftentimes these look like noise. A maximum level of 2 is recommended.
-n <sample total min> (default: 1): Minimum number of samples that must have a candidate for it to be considered. With low number of samples 1 is recommended. For larger studies this can be increased.
-t <treatment total min> (default: 0): In case the user would like to only consider candidates that are present in a certain number of treatment samples.
-x <treatment exclusive> (default: no): In case the user would like a maximum for the untreated samples. This could be if the user wants to only consider treatment-exclusive samples. In that case, the user should put yes instead of no.
-k <keep none> (default: no): There can be transcripts from TEs that have splicing, but do not end up splicing into a gene. By default these are removed. This can be changed to yes if the user would like to look at these.
-f <filter for TEs> (default: yes): Repeatmasker files include many repeats besides TE. This could be an interesting benchmark to compare to, but by default they are removed. The user can specify no if they would like to keep these. NOTE: Currently does not work with downstream steps if filter is set to No.
-a <argument file> (default: <directory of pipeline>/arguments.txt): The arguments.txt file location. By default, the arguments.txt file that is in the rnapipeline directory will be used. If another is desired for use then the fullpath of the file can be specified.
Output File(s):
(1) filter_combined_candidates.tsv: A file with every TE-gene transcript. This file is used for calculating read information in subsequent steps.
(2) initial_candidate_list.tsv: A summary of filter_combined_candidates.tsv for each unique candidate. Also lists the treatment and untreated files that the candidate is present in.
(3) Step4.RData: Workspace file with data loaded from R session. Subsequent steps load this to save time.
Note:

This step will give ALL assembled transcripts. Assembly is very noisy, and thus the output of this step will have a high false positive rate. Filtering based on tpm or fraction of tpm can help. It is reccomended that the user looks at initial_candidate_list.tsv and assures that the candidates are in the format that they desire.

Calculate Read Information

It is recommended that candidates are confirmed using read information. We have found this to greatly increase the specificity of the pipeline since assembly can make many errors especially without a reference.
We have developed a fast parser using samtools and bedtools to be able to get read information. There are separate programs for either single and or paired end reads. The only argument needed is the path of the bam files.
Note:

Bam files should be sorted by position and indexed for this to work. Bam file naems should be the same as the gtf files that were the oriignal input for the pipeline except the extention is different. It is recommended that paired-end reads are used. Single-end performance is not as robust.

cd /mnt/d/D/xuhonglab/singleED/NCB/bulk-seq/eu.ec/
mkdir filterreadstats
commandsmax_speed_se.py filter_combined_candidates.tsv ../motif/
#雙端測序采用這個命令commandsmax_speed.py filter_combined_candidates.tsv ../motif/
echo "(4/8) Filter based on Reads"
parallel -j 10 < filterreadcommands_se.txt
合并所有已讀信息文件
find ./filterreadstats -name "*.stats" -type f -maxdepth 1 -print0 | xargs -0 -n128 -P1 grep -H e > resultgrep_filterreadstatsdone.txt
cat resultgrep_filterreadstatsdone.txt | sed 's/\:/\t/g' > filter_read_stats.txt
filterReadCandidates.R 

Output File(s):

(1) read_filtered_candidates.tsv: A file with every TE-gene transcript. This file is used for calculating read information in subsequent steps.
(2) candidate_transcripts.gff3: A GFF3 formatted file of all the unique TE-gene transcripts detected. This can be viewed on a genome viewer to see what candidates look like. This is used in subsequent steps.
(3) Step6.RData: Workspace file with data loaded from R session. Subsequent steps load this to save time.
Remove step 4 session data that is no longer needed.

與參考GTF合并

gffread -E candidate_transcripts.gff3 -T -o candidate_transcripts.gtf
echo candidate_transcripts.gtf > cuffmergegtf.list
mkdir merged_asm_full
cuffmerge -o ./merged_asm_full -g /mnt/d/D/linux/genome/gencode-h/gencode.v42.chr_patch_hapl_scaff.annotation.gtf cuffmergegtf.list
mv ./merged_asm_full/merged.gtf reference_merged_candidates.gtf
gffread -E reference_merged_candidates.gtf -o- > reference_merged_candidates.gff3
echo "(5/8) Annotate Merged Transcripts"
rmskhg38_annotate_gtf_update_test_tpm_cuff.py reference_merged_candidates.gff3

Output File(s):
(1) reference_merged_candidates.gff3_annotated_test_all
(2) reference_merged_candidates.gff3_annotated_filtered_test_all <(Final List of all TE-gene transcripts)
(3) reference_merged_candidates.gff3_annotated_test*
(4) reference_merged_candidates.gff3_annotated_filtered_test
Note:

We recommend users to view the reference_merged_candidates.gff3 on a genome browser to confirm that the merged transcript file has the expected candidates. The user can map transcripts back to their annotation using the transcript ID and confirm proper identification.

計算轉錄本

find ../motif/ -name "*bam" | while read file ; do xbase=${file##*/}; echo "samtools view -q 255 -h "$file" | stringtie - -o "${xbase%.*}".gtf -e -b "${xbase%.*}"_stats -p 2 --fr -m 100 -c 1 -G reference_merged_candidates.gtf" >> quantificationCommands.txt ; done ;
echo "(6/8) Transcript Quantification"
parallel -j 10 < quantificationCommands.txt
mergeAnnotationProcess.R

Options with Defaults:
-f <merged gtf annotation file> (default: reference_merged_candidates.gff3_annotated_filtered_test_all): Reference transcript file to be processed. Default will work if all the previous steps have been done as described with the same names.
Output File(s):
(1) candidate_introns.txt: A large one-line string with all intron locations. This is used in susequent steps to extract intron read information
(2) candidate_names.txt: The trnascript names for all candidate transcripts that are left
(3) Step10.RData: Workspace file with data loaded from R session. Subsequent steps load this to save time.
Remove step 6 session data that is no longer needed.

處理stringtie transcript注釋文件,獲取相關信息并進行匯總

(Th stringtie output needs to be formmated into a table, and thus we have a few helper scripts and bash commands to make these tables.)

ls ./*stats/i_data.ctab > ctab.txt
cat ctab.txt | while read ID ; do fileid=$(echo "$ID" | awk -F "/" '{print $2}'| awk -F "_" '{print $1}'); cp $ID ./${fileid}_i_data.ctab; done;

find . -name "*i_data.ctab" > ctab_i.txt
cat ctab_i.txt | while read ID ; do fileid=$(echo "$ID" | awk -F "/" '{print $2}'); cat <(printf 'chr\tstrand\tstart\tend\t'${fileid/_stats/}'\n') <(grep -f candidate_introns.txt $ID | awk -F'\t' '{ print $2"\t"$3"\t"$4"\t"$5"\t"$6 }') > ${ID}_cand ; done ;

cat <(find . -name "*i_data.ctab_cand" | head -1 | while read file ; do cat $file | awk '{print $1"\t"$2"\t"$3"\t"$4}' ; done;) > table_i_all

find . -name "*i_data.ctab_cand" | while read file ; do paste -d'\t' <(cat table_i_all) <(cat $file | awk '{print $5}') > table_i_all_temp; mv table_i_all_temp table_i_all; done ;

ls ./*stats/t_data.ctab > ctablist.txt

cat ctablist.txt | while read file ; do echo "stringtieExpressionFrac.py $file" >> stringtieExpressionFracCommands.txt ; done;

echo "(7/8) Transcript Quantification比較耗時間2小時"
parallel -j 24 < stringtieExpressionFracCommands.txt

ls ./*stats/t_data.ctab_frac_tot > ctab_frac_tot_files.txt
ls ./*stats/t_data.ctab_tpm > ctab_tpm_files.txt

cat <(echo "TranscriptID") <(find . -name "*ctab_frac_tot" | head -1 | while read file ; do sort $file | awk '{print $1}' ; done;) > table_frac_tot
cat ctab_frac_tot_files.txt | while read file ; do fileid=$(echo "$file" | awk -F "/" '{print $2}') ; paste -d'\t' <(cat table_frac_tot) <(cat <(echo ${fileid/_stats/}) <(sort $file | awk '{print $2}')) > table_frac_tot_temp; mv table_frac_tot_temp table_frac_tot; done ;

cat <(echo "TranscriptID") <(find . -name "*ctab_tpm" | head -1 | while read file ; do sort $file | awk '{print $1}' ; done;) > table_tpm
cat ctab_tpm_files.txt | while read file ; do fileid=$(echo "$file" | awk -F "/" '{print $2}') ; paste -d'\t' <(cat table_tpm) <(cat <(echo ${fileid/_stats/}) <(sort $file | awk '{print $2}')) > table_tpm_temp; mv table_tpm_temp table_tpm; done ;

cat <(head -1 table_frac_tot) <(grep -Ff candidate_names.txt table_frac_tot) > table_frac_tot_cand
cat <(head -1 table_tpm) <(grep -Ff candidate_names.txt table_tpm) > table_tpm_cand

樣品識別,定量和創(chuàng)建最后的表格

聚合stringtie注釋和內含子覆蓋率信息。它將根據(jù)用戶設置的參數(shù)預測是否存在。用戶還可以使用此步驟輸出的表進行更高級的統(tǒng)計分析。

finalStatisticsOutput.R -e "Eu"

Output File(s):

(1) All TE-derived Alternative Isoforms Statistics.xlsx: A file with the final statistics calculated for each candidate. There is also data on the gene expression in both groups (treatment and normal), fraction of gene expression (treatment and normal), the number of reads to main splicing intron (treatment and normal), and treatment enrichment.
Note:

The Treatment Count and Normal Count columns are calculated based on the number of files in which the candidate passes the thresholds set on fraction of expression, total gene expression, and number of reads spanning the main splicing intron. The final table has all the data used for this, so the user can try using different thresholds to optimize candidate presence based on their data.

(2) allCandidateStatistics.tsv: file with gene expression, fraction expression, transcript expression, and intron junction read information across all the samples for all the candidates.
(3) Step11_FINAL.RData: Workspace file with data loaded from R session. Can be loaded by user to save time to do more advanced analysis.
Remove old RData, the final one will have all data from previous ones.

用Kozak法翻譯轉錄本,生成RNa序列的FASTA

?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
【社區(qū)內容提示】社區(qū)部分內容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發(fā)布,文章內容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內容

友情鏈接更多精彩內容