ESPRESSO全稱：Error Statistics PRomoted Evaluator of Splice Site Options，一種處理long read RNA-seq數(shù)據(jù)alignment的方法，可以有效地提高splice junction精度和isoform定量。
下載地址：https://github.com/Xinglab/espresso/releases/

省流版

用基因組+參考注釋確定高置信度的splice junction，根據(jù)這些SJ對基因組中的每個(gè)reads進(jìn)行校正和恢復(fù)，最后用校正過的SJ和注釋信息對所有isoforms定量，輸出更新后的注釋信息gtf文件和isoforms表達(dá)豐度文件。

詳細(xì)版

依賴的軟件：

Perl >= 5.8 built with threading enabled
hmmer >= 3.3.1

conda install -c bioconda hmmer

BLAST >= 2.8.1
軟件包地址：https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/
samtools >= 1.6
（最好創(chuàng)建一個(gè)新的環(huán)境來安裝samtools）

conda create -n samtools_env
conda activate samtools_env
conda install -c bioconda samtools openssl=1.0

（以下是可視化依賴的軟件）

UCSC tools
-- bedGraphToBigWig
-- faToTwoBit
-- twoBitInfo
軟件包地址：https://hgdownload.soe.ucsc.edu/admin/exe/
可以用rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ ./下載，默認(rèn)是當(dāng)前目錄，但是有n個(gè)執(zhí)行文件，最好創(chuàng)建一個(gè)新的目錄統(tǒng)一存放，不然當(dāng)前目錄會(huì)爆炸
Python 3
-- NumPy
IGV：https://igv.org/

需要的文件：

sam文件/bam文件
如果不是這種文件類型，可以用minimap2或ONT guppy basecaller輸出這種文件
gtf注釋文件
fasta參考基因組文件

使ESPRESSO在所有目錄下可用

在每個(gè)pl或py腳本文件第一行加#! /usr/bin/perl或#! /usr/bin/python3
更改每一個(gè)腳本文件權(quán)限，增加可執(zhí)行權(quán)限chmod +x ESPRESSO_C.pl
將ESPRESSO執(zhí)行文件加入環(huán)境變量
這種方法可以在本用戶下永久添加環(huán)境變量

vim ~/.bashrc
輸入
export PATH=/data/workdir/wubq/soft/espresso_v_1_3_2/src/:$PATH
export PATH=/data/workdir/wubq/soft/espresso_v_1_3_2/visualization/:$PATH
保存
source ~/.bashrc

如果你沒有做這步的話，后續(xù)用pl或py腳本需要在命令行前加perl或python3，并且執(zhí)行文件要加路徑

預(yù)備工作：準(zhǔn)備samples.tsv文件

/path/to/SIRV2_3.sort.sam   test_sample_name  # tab鍵分隔

在本例中，只有一個(gè)SAM文件。如果有多個(gè)SAM文件，那么每個(gè)文件都單獨(dú)一行。第二列是sample名稱，可用于在最終輸出中將多個(gè)輸入SAM文件分組在一起。

第一步：確定高置信度的splice junctions——ESPRESSO_S

用法：

  ESPRESSO_S.pl -L samples.tsv -F ref.fa -A anno.gtf -O work_dir

-L  剛剛創(chuàng)建的tsv文件，第一列是帶有絕對路徑的sorted sam文件；第二列是sample名稱
-F  reference文件
-A  注釋文件
-B  你認(rèn)為信得過的splice junction，bed文件格式
-O  輸出文件的目錄名稱（不是輸出文件名），默認(rèn)在當(dāng)前目錄
-T  線程數(shù)

用test_sirv數(shù)據(jù)跑的話輸出是：

[Thu Sep  7 10:42:45 2023] Loading reference
Worker 0 begins to scan: 
 /data/workdir/wubq/soft/espresso_v_1_3_2/test_data/test_data_espresso_sirv/SIRV2_3.sort.sam
Worker 0 finished reporting.
[Thu Sep  7 10:42:50 2023] Re-cluster all reads
[Thu Sep  7 10:42:50 2023] Loading annotation
[Thu Sep  7 10:42:50 2023] Summarizing annotated splice junctions for each read group
/data/workdir/wubq/soft/espresso_v_1_3_2/test_data/test_data_espresso_sirv/SIRV2_3.sort.sam(0)
0_1(0)
Worker 0 begins to scan: 
 /data/workdir/wubq/soft/espresso_v_1_3_2/test_data/test_data_espresso_sirv/SIRV2_3.sort.sam
/data/workdir/wubq/soft/espresso_v_1_3_2/test_data/test_data_espresso_sirv/SIRV2_3.sort.sam 0
Worker 0 finished reporting.
[Thu Sep  7 10:42:57 2023] ESPRESSO_S finished its work.

在test_sirv目錄下有以下文件：

ESPRESSO_S結(jié)果

其中sample.tsv.updated文件是為SAM文件分配目標(biāo)id
里面只分配到一個(gè)，ID為0

/path/to/SIRV2_3.sort.sam   test_sample_name    0

第二步：校正和恢復(fù)每個(gè)read的splice junction——ESPRESSO_C

用法：

ESPRESSO_C.pl -I test_sirv -F SIRV2.fasta -X 0 -T 5

-I  (in)由上一步ESPRESSO_S所得的文件所在目錄
-F  fasta reference文件
-X  目標(biāo)ID，如果有多個(gè)輸入，那么ESPRESSO_C將為samples.tsv.updated中的每個(gè)ID分別運(yùn)行
-T  線程數(shù)

輸出：

[Thu Sep  7 13:47:51 2023] Loading splice junction info
[Thu Sep  7 13:47:52 2023] Requesting system to split SAMLIST into 5 pieces
 Divided SAM(LIST) sizes:
 sam.list3aa            1242019
 sam.list3ab            1242019
 sam.list3ac            1242019
 sam.list3ad            1242019
 sam.list3ae            1242019
 SAM(LIST) was divided successfully.
 First group of divided SAM(LIST) files: 
 sam.list3aa: 0
 First reads were recorded successfully for 1 files.
[Thu Sep  7 13:47:52 2023] Loading references
[Thu Sep  7 13:47:52 2023] Scanning SAMLIST by 2 workers
 Worker 1 begins to scan sam.list3aa.
0   16

# 此處省略一堆

Building a new DB, current time: 09/07/2023 13:48:14
New DB name:   /data/workdir/wubq/soft/espresso_v_1_3_2/test_data/test_sirv/0/blast_0/current_db
New DB title:  current_db
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named /data/workdir/wubq/soft/espresso_v_1_3_2/test_data/test_sirv/0/blast_0/current_db
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 2 sequences in 0.000781059 seconds.
Worker 1 finished reporting.
[Thu Sep  7 13:48:37 2023] ESPRESSO_C finished its work.

第三步：從reads中鑒定和定量所有isoforms——ESPRESSO_Q

用法：

ESPRESSO_Q.pl -A SIRV_C.gtf -L test_sirv/samples.tsv.updated -V test_sirv/samples_N2_R0_compatible_isoform.tsv

-A  gtf注釋文件
-L  sample list的tsv文件
-V  對于每個(gè)reads的兼容isoform的tsv輸出文件（可能是從屬于某個(gè)isoform的read的可變剪切類型）
-O 輸出文件所在目錄，默認(rèn)在sample list的tsv文件目錄里
-T  線程數(shù)

輸出：

[Thu Sep  7 14:02:47 2023] Loading annotation
[Thu Sep  7 14:02:47 2023] Summarizing annotated isoforms
[Thu Sep  7 14:02:47 2023] Loading corrected splice junctions and alignment information by ESPRESSO
[Thu Sep  7 14:02:48 2023] Categorizing reads according to annotation
[Thu Sep  7 14:02:48 2023] thread_id: 4 starting to process chr: SIRV2
running command: /usr/bin/perl ../src/ESPRESSO_Q_Thread.pm test_sirv/thread_shared_arguments.tmp test_sirv/thread_SIRV2_arguments.tmp
finished command: /usr/bin/perl ../src/ESPRESSO_Q_Thread.pm test_sirv/thread_shared_arguments.tmp test_sirv/thread_SIRV2_arguments.tmp
[Thu Sep  7 14:02:58 2023] thread_id: 4 finished
[Thu Sep  7 14:02:58 2023] cleaning up threads
[Thu Sep  7 14:03:00 2023] ESPRESSO finished quantification

第四步：可視化

用法：

visualize.py --genome-fasta SIRV2.fasta --updated-gtf test_sirv/samples_N2_R0_updated.gtf --abundance-esp test_sirv/samples_N2_R0_abundance.esp --target-gene SIRV2 --minimum-count 1 --descriptive-name SIRV --output-dir test_sirv/visualization

--genome-fasta  輸入?yún)⒖蓟蚪M文件
--updated-gtf  由ESPRESSO生成的，更新后的gtf文件
--abundance-esp   由ESPRESSO生成的豐度文件
--target-gene  要可視化的基因，使用gene_id來匹配ESPRESSO輸出的新異構(gòu)體
--minimum-count  只考慮在一個(gè)樣本的count滿足最小count的isoforms (我不理解這是什么意思TAT)
  (原句：only isoforms where the (normalized) count for a sample meets the minimum count are considered)
--normalize-counts-to-cpm  將原始計(jì)數(shù)轉(zhuǎn)換為CPM
--descriptive-name 圖標(biāo)及文件名
--output-dir  輸出文件所在目錄

可視化輸出文件

這些輸出文件需要用IGV可視化

Genomes -> Load Genome from File -> {file}.fasta and {file}.fasta.fai
File -> Load from File -> {file}.gtf
-- Right click gtf track.
-- Select "squished".
-- Set track height to a large enough value.
File -> Load from File -> {sample_name}.bw for each sample
-- Right click track.
-- Select "Autoscale".
--- Select "Maximum" under "Windowing Function".
-- "Change Track Color".
View -> Add New Panel
-- Click and drag panel borders to show the new panel.
File -> Load from File -> target_genes/{file}.bed
-- Click name of new track and drag to the appropriate panel.
Adjust the coordinates to show the data.
File -> Save Image ->
-- Change file extension to .svg

visualization_sirv.png

解釋一下為什么轉(zhuǎn)換的是CPM，而不是平常轉(zhuǎn)錄組分析中表達(dá)定量常用的RPKM、FPKM、TPM
CPM全稱是Counts per million，是對測序深度進(jìn)行標(biāo)準(zhǔn)化，而TPM是先對基因長度進(jìn)行標(biāo)準(zhǔn)化，再對測序深度進(jìn)行標(biāo)準(zhǔn)化。short read測序建庫時(shí)切割了很多個(gè)比較均勻的短read，基因表達(dá)越多，reads數(shù)越多，所以需要校正。而long read測序，一個(gè)read幾乎就是一條transcript，所以不需要校正基因長度。3‘端polyA如果出錯(cuò)，5’端可能會(huì)測不到或不均勻，也沒有校正的意義。ESPRESSO用的數(shù)據(jù)是三代測序技術(shù)的數(shù)據(jù)。

推薦另一個(gè)可視化軟件，也是邢毅老師團(tuán)隊(duì)開發(fā)的，rMATS-long：https://github.com/Xinglab/rMATS-long?？梢暬Ч?。但是它有一點(diǎn)不太好的地方是只能畫豐度最高的5個(gè)或以下isoform，其他一律歸為”others“?？梢栽趘isualize_isoforms.py中增加顏色，增加越多越好，增加多一種顏色就可以畫多一條isoform。在畫圖時(shí)通過max-transcripts參數(shù)輸入你的isoform個(gè)數(shù)，即可全部畫出。

rMATS-long可視化isoform占比和CPM

rMATS-long可視化isoform結(jié)構(gòu)

visualize_isoforms.py增加顏色

解釋一下各輸出文件

sample_N2_R0_updated.gtf：為檢測到的isoforms更新的GTF注釋文件

每一個(gè)檢測到的isoform會(huì)視為一個(gè)轉(zhuǎn)錄本
source列表示每個(gè)異構(gòu)體是“novel_isoform”還是“annotated_isoform”

sample_N2_R0_abundance.esp：檢測到的isoforms的表達(dá)豐度文件，是tsv文件

每個(gè)檢測到的isoform會(huì)在單獨(dú)的行上寫明
第一列是 transcript_ID, transcript_name, gene_ID
在samples.tsv中提供的每個(gè)樣本名稱都有一個(gè)額外的列。這些列表示從該樣本中被算到這個(gè)isoform的reads數(shù)。
counts是通過期望最大化（EM）來分配的。每個(gè)輸入read最多貢獻(xiàn)1個(gè)count，可以是單個(gè)isoform，也可以是多個(gè)isoform。

sample_N2_R0_compatible_isoform.tsv：每個(gè)read的兼容isoform（可能是從屬于某個(gè)isoform的read？猜的）

該文件僅在使用ESPRESSO_Q的-V參數(shù)時(shí)生成。
各列分別是 read_id, sample_name, read_classification, compatible_isoforms
有以下幾種分類 NIC/NNC, NCD, ISM, FSM, single-exon

####################################
2023.12.1更新

報(bào)錯(cuò)合集：

ESPRESSO_S報(bào)錯(cuò)：

out of memory
原因：1.3.2版本及以上需要perl storable版本在3.0以上，我們的服務(wù)器是2.62?？梢酝ㄟ^perl -e 'use Storable; print("gstorable::VERSIONn")!查看版本
解決：升級perl storable

out of memory

ESPRESS_C報(bào)錯(cuò)：

找不到3.3.1版本的nhmmer，但實(shí)際上安裝了nhmmer，版本號(hào)更高
原因：1.3.2版本設(shè)定只接受3.3.1的nhmmer
解決：自己改一下ESPRESSO_C.pl腳本，把==1改為>=1，如此類推。或者直接用最新版本1.4

nhmmer

源代碼

ESPRESSO_Q報(bào)錯(cuò)：

沒有一個(gè)git庫(not a git repository)
原因：未知
解決：用git init初始化后也沒用，還是不行。

不存在git庫

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

ESPRESSO可變剪切分析使用說明

ESPRESSO可變剪切分析使用說明

省流版

詳細(xì)版

預(yù)備工作：準(zhǔn)備samples.tsv文件

第一步：確定高置信度的splice junctions——ESPRESSO_S

第二步：校正和恢復(fù)每個(gè)read的splice junction——ESPRESSO_C

第三步：從reads中鑒定和定量所有isoforms——ESPRESSO_Q

第四步：可視化

解釋一下各輸出文件

報(bào)錯(cuò)合集：

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

ESPRESSO可變剪切分析使用說明

省流版

詳細(xì)版

預(yù)備工作：準(zhǔn)備samples.tsv文件

第一步：確定高置信度的splice junctions——ESPRESSO_S

第二步：校正和恢復(fù)每個(gè)read的splice junction——ESPRESSO_C

第三步：從reads中鑒定和定量所有isoforms——ESPRESSO_Q

第四步：可視化

解釋一下各輸出文件

報(bào)錯(cuò)合集：

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av