單細(xì)胞長(zhǎng)讀長(zhǎng)轉(zhuǎn)錄本異構(gòu)體差異表達(dá)分析DIE

軟件——scisorseqr:https://github.com/noush-joglekar/scisorseqr
前期工作準(zhǔn)備:
1.二代數(shù)據(jù)跑完cellranger,得到barcode、feature、matrix文件。輸入到seurat進(jìn)行數(shù)據(jù)質(zhì)控、細(xì)胞分群、細(xì)胞類型注釋
2.如果沒有二代數(shù)據(jù),只有三代數(shù)據(jù),也可以跑sicelore、Isosceles等軟件。得到[isoform x cell] matrix后輸入到seurat。

安裝

軟件依賴:

  • STARlong for PacBio reads
  • Minimap2 for Oxford Nanopore (or PacBio) reads
  • samtools
  • bedtools
  • python version >=3.7
  • r: devtools,Seurat,tidyr,dplyr
devtools::install_github('noush-joglekar/scisorseqr',build_vignettes = FALSE)
# 教程教的是build_vignettes = TRUE,但是一般都是在linux下跑這個(gè)軟件,build_vignettes需要打開瀏覽器。就很麻煩。github里面自帶教程,可以直接下,沒必要再生成
# https://github.com/noush-joglekar/scisorseqr/blob/master/vignettes/scisorseqr-StandardWorkflow.pdf
# 如果你在windows的R下載的話,可以選擇build_vignettes = TRUE
# 但有可能會(huì)報(bào)錯(cuò):pdflatex: command not found
報(bào)錯(cuò)的話安裝以下:
install.packages("tinytex")
tinytex::install_tinytex()

雖然有教程,但是還是有點(diǎn)簡(jiǎn)略。建立起自己的認(rèn)識(shí),還是寫一寫。預(yù)先告知,我跑到第四步就跑不下去了,后面幾步我也不知道會(huì)遇到什么報(bào)錯(cuò)結(jié)果orz

第一步: Barcode deconvolution

library(Seurat)
library(tidyr)
library(dplyr)
library(scisorseqr)


# 讀入前期細(xì)胞注釋好的seurat object
sce.all<-readRDS("sce.all.final.cellname.rds")

# 獲取barcode 和細(xì)胞類型對(duì)應(yīng)的表
bc <- data.frame("Barcode"=rownames(sce.all@meta.data),
                 "Celltype"=sce.all@meta.data$seurat_annotations) %>%
                  separate(Barcode,into=c("BC","suffix"),sep="-") %>%
                  select(-suffix) %>% as.data.frame()
# 保存一下
write.table(bc, file = "bc_celltype_assignments.txt", sep="\t", row.names = FALSE,
            col.names = FALSE, quote = FALSE)
head(bc)
> BC         Celltype
> AAACCTGAGAATGTGT Pro-B_cell_CD34+
> AAACCTGAGATGGGTC Pro-B_cell_CD34+
> AAACCTGAGCAGGTCA Pro-B_cell_CD34+
> AAACCTGAGGCTCATT Pro-B_cell_CD34+
> AAACCTGAGTCAAGCG        Monocytes
> AAACCTGCAAAGGAAG Pro-B_cell_CD34+
   

bc_clust <- c("scisorseqr/bc_celltype_assignments.txt")
fastqFolder <- c("scisorseqr/fastqFile/OC_SRR25278608/")

########### 非常多的坑啊啊啊啊啊 ############
有可能有些數(shù)據(jù)的barcode是普通單細(xì)胞數(shù)據(jù),用的是v3,也有的是單細(xì)胞ATAC-seq,用的是ARC-v1。
1. 有哪些barcode?如下
- https://kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist
2. 我下的是公共數(shù)據(jù),我不知道它用的是什么barcode。
- cellranger識(shí)別barcode的時(shí)候默認(rèn)是“auto”自動(dòng)識(shí)別。如果成功跑下來沒報(bào)錯(cuò),那么會(huì)輸出一個(gè)sample/out/web_summary.html的文件,里面就有說這個(gè)文件用的是什么barcode集
3. fqFolder:fastq.gz所在的“文件夾”路徑(最后面要加上“/”,不然也會(huì)報(bào)錯(cuò)!),而不是文件夾里的fastq.gz文件。
- 文件夾里只能有一個(gè)fastq,且和文件夾同名。e.g.文件夾:ABC;文件:ABC.fastq.gz
- 否則就會(huì)報(bào)錯(cuò)“cat: 'getBarcodeOutput/OutputRaw//DeconvBC*.csv': No such file or directory”
4. bc_clust:上一步得到的bc_celltype_assignments.txt所在路徑,而不是讀入文件的data.frame
- 否則就會(huì)報(bào)錯(cuò)
“cat: './OutputRaw//tmp*': No such file or directory
rm: cannot remove './OutputRaw//tmp*': No such file or directory
awk: cmd. line:7: fatal: division by zero attempted
Warning message:
In system(run_bc_deConv) : error in running command”
5. 如果你有多個(gè)文件(replicate)只能一個(gè)一個(gè)處理,所以可以寫個(gè)for循環(huán)....
################# 非常多的坑啊啊啊啊啊 ############


## run command
GetBarcodes(fqFolder = fastqFolder, BCClustAssignFile = bc_clust,
            chemistry = "v3", concatenate = FALSE, filterReads = FALSE)
GetBarcodes(fqFolder = fastqFolder, BCClustAssignFile = bc_clust,
            chemistry = "ARC-v1", concatenate = FALSE, filterReads = FALSE)

第二步: Align your fastq files to the reference

這個(gè)我自己跑minimap2了,因?yàn)檫€有其他東西要跑,所以之前就跑好了這個(gè)文件。
軟件里跑的minimap2參數(shù)是-ax splice --secondary=no ,供參考
軟件跑minimap2的結(jié)果放在MMoutput的文件夾下,里面有bam文件和REPORT.minimap2
并且軟件這步還會(huì)創(chuàng)一個(gè)Misc的文件夾,放了fastqGuide和bamGuide兩個(gè)文件。

  • bamGuide:bam文件絕對(duì)路徑
  • fastqGuide:兩列,第一列為fastq名,\t,第二列為fastq.gz絕對(duì)路徑。
    如果用自己之前就跑好的bam文件的話需要自行創(chuàng)好以上文件
MMalign(fqFolder = fastqFolder, mmProgPath = 'miniconda3/envs/scisorseqr/bin/minimap2',refGenome='genomes/hg38.fa',numThreads = 16)

第三步: Map and filter for full-length, spliced reads

這里需要一個(gè)按染色體分解參考基因組的路徑seqDir,可以用faidx解決

pip install pyfaidx
mkdir chr_fa
cd chr_fa
faidx -x PATH/TO/hg38.fa
gzip ./*

MapAndFilter('LRoutput',annoGZ='database/hg38.gtf.gz',numThreads=24,seqDir="chr_fa",genomeVersion="hg38")

報(bào)錯(cuò):
gzip: LRoutput/mapping.bestperRead.RNAdirection.withConsensIntrons.introns.gff.gz: No such file or directory
gzip: LRoutput/mapping.bestperRead.RNAdirection.withConsensIntrons.gff.gz: No such file or directory
gzip: LRoutput/newIsoforms_vs_Anno_ignoreAnno/stretches: No such file or directory 
mv: cannot stat 'tmpDir/mapping.bestperRead.bam': No such file or directory
原因:
所在文件夾和各文件夾位置錯(cuò)了。腳本是按相對(duì)路徑尋找文件,所以各文件的結(jié)構(gòu)非常重要。需要調(diào)整文件夾成如下結(jié)構(gòu),bamGuide內(nèi)容也需要改動(dòng):
并且這一步需要在mmalign目錄下跑
|-- test_scisorseqr
|   |-- bc_celltype_assignments
|   |-- test_scisorseqr.fastq.gz
|   |-- output
|   |   |   |-- OutputFiltered
|   |   |   |   |-- FilteredDeconvBC_AllFiles.csv
|   |   |   |-- OutputRaw
|   |   |   |   |-- AllFiles_Summary
|   |   |   |   |-- DeconvBC_TestHipp.csv
|   |   |   |   |-- DeconvBC_TestHipp_summary
|   |   |   |   |-- TooShort.csv
|   |   |   |   |-- DeconvBC_TestPFC.csv
|   |   |   |   |-- DeconvBC_TestPFC_summary
|   |-- mmalign
|   |   |-- MMoutput
|   |   |   |-- REPORT.minimap
|   |   |   |-- TestHipp.bam
|   |   |   |-- TestPFC.bam
|   |   |-- Misc
|   |   |   |-- bamGuide
|   |   |   |-- fastqGuide

以下是我的輸出,供參考(這步跑好之后最好把LRoutput備份一份,免得后面又報(bào)錯(cuò),會(huì)把里面的文件給刪掉........又得重新跑)

bedtools found in path
samtools found in path
#################################
# RUNNING [miniconda3/envs/scisorseqr/lib/R/library/scisorseqr/bash/mapAndAlignReads.sh]
# Current date: Fri Nov 15 11:12:58 AM CST 2024
# Current dir: mmalign
+++++++++++++++++++++++++ 1. arguments
fastqGuide=Misc/fastqGuide
outdir=LRoutput
tmpDir=tmpDir
numThreads=24
seqDirectory=database/chr_fa
annoGZ=database/hg38.gtf.gz
mappingBAMGuide=Misc/bamGuide
scriptDir=miniconda3/envs/scisorseqr/lib/R/library/scisorseqr/bash
genome + version=hg38
++++++++++++++++++ 1b. deduced from arguments
++++++++++++++++++ 1c. input check
[bam_merge] File 'tmpDir/mapping.bam' exists. Please apply '-f' to overwrite. Abort.
+++++++++++++++++++++ 2b. sorting annotation and projection
+++++++++++++++++++++ 2b1 sorting annotation
 
real    1m6.919s
user    0m32.154s
sys 0m4.844s
++++++++++++++++++ 2c. how many were well mapped
+++++++++++++++ 2c1 finding best hits
wellMapped: mmalign/MMoutput/Hipp_SN_f1_ONT_Run1.bam    20993299    867037  274931  2532303 0.886197
+++++++++++++++ 2c2 sam to bam conversion
+++++++++++++++++++++ 2d. removing ribosomal RNA hits
++++++++++++++++++ 2d1. finding ribosomal RNA hits
++++++++++++++++++ 2d2. removing ribosomal RNA hits
++++++++++++++++++++++++ 3. further analysis
+++++++++++++++++++++ 3a. checking consensus
++++++++++++++++++ 3a1. getting introns and exons in gff format
               
real    14m32.392s
user    11m23.011s
sys 0m35.452s
                  
real    17m56.608s
user    10m38.717s
sys 0m34.499s
++++++++++++++++++ 3a2. getting dinucleotides at intron borders
++++++++++++++++++ 3a2.a. preparing commands for parallel execution
 database/chr_fa/chr1.fa.gz
database/chr_fa/chr10.fa.gz
......
database/chr_fa/chrY.fa.gz
+++++++++++++++ 3a2b execution 
## 0. params
# 0a. read params
# 0b. print params
commandFile=LRoutput/parallel.comp.anno.guide.3.siteSeq
n=24
## 1. reading organizational data
## 2. execution
   
real    3m8.427s
user    45m43.905s
sys 7m17.967s
+++++++++++++++ 3a2c collecting results and removing temporary files
cat tmpDir/site_sequence.chr1 tmpDir/site_sequence.chr10 tmpDir/site_sequence.chr10_GL383545v1_alt tmpDir/site_sequence.chr10_KI270824v1_alt tmpDir/site_sequence.chr10_KQ090021v1_fix ......
tmpDir/site_sequence.chr10_ML143354v1_fix tmpDir/site_sequence.chr10_ML143355v1_fix 
tmpDir/site_sequence.chrY
       
real    6m40.000s
user    6m20.356s
sys 0m4.443s
+++++++++++++++++++++ 3b. strand correction and follow up analysis
++++++++++++++++++ 3b1. getting mapping.bestperRead.noRiboRNA.bam in gff format 
            
real    12m29.409s
user    11m48.126s
sys 0m27.614s
    
real    4m15.684s
user    3m26.646s
sys 0m3.599s
awk: cmd. line:1: warning: regexp escape sequence `\"' is not a known regexp operator
++++++++++++++++++ 3b3. getting introns
awk: cmd. line:2: warning: regexp escape sequence `\@' is not a known regexp operator
        
real    7m58.682s
user    6m28.702s
sys 0m23.712s
++++++++++++++++++ 3b4. getting genes
### executing pB.getGenes.awk
## A. parsing file2=LRoutput/mapping.bestperRead.RNAdirection.withConsensIntrons.introns.gff
## B. parsing annotation: 
## C. going over all annotated transcripts: 
## D. counting the number of splice sites per gene:
## E. checking whether a read has equal numbers of splice sites with multiple genes: 
   
real    85m17.427s
user    84m36.266s
sys 1m16.844s
+++++++++++++++++++++++++ 4. zipping what has not be zipped before
+++++++++++++++++++++++++ 5. new isoforms
++++++++++++++++++++++ 5a. annotation
 
real    1m35.860s
user    1m39.012s
sys 0m1.997s
++++++++++++++++++++++ 5b. RNAseq
                                                    
real    52m13.298s
user    56m2.148s
sys 0m50.638s
      
real    5m57.071s
user    5m20.826s
sys 0m14.671s
++++++++++++++++++++++ 5c. comparison
## 0. preliminaries
## 1. annotation: assuming non-zipped, genomically ordered gff in the format Nicholas provided
## 1a. parsing
## 1b. getting all the stretches
## 2. prediction: assuming non-zipped, genomically ordered gff in the format Nicholas provided
## 2a. parsing
## 2b. are the RNAseq stretches annotated ?
+++++++++++++++++++++++++ 6. done

第四步:Concatenate all the information collected so far

目前還是在mmalign目錄下
# 整理所有結(jié)果
InfoPerLongRead(barcodeOutputFile =bc_clust,
                mapAndFilterOut = 'LRoutput/',
                minTimesIsoObserve = 5, rmTmpFolder = FALSE)
如果這里的rmTmpFolder設(shè)置成TRUE會(huì)報(bào)錯(cuò)...我也不知道為什么

在這步我沒有輸出任何信息。但是也沒有報(bào)錯(cuò)。有可能是因?yàn)槲逸斎氲膄astq.gz是sce.obj的一部分細(xì)胞數(shù)據(jù),而不是全部。這個(gè)我實(shí)在解決不了了,只能終止到這步了。

第五步:Quantify the various isoforms observed in your data

IsoQuant(AllInfoFile = 'LongReadInfo/AllInfo',Iso = TRUE)

第六步: Differential isoform analysis

contig文件內(nèi)容是想比較的細(xì)胞類型,如下:

Comparison1Celltypes Comparison2Celltypes
HippNeuron Hipp_ExciteNeuron,Hipp_InhibNeuron,Hipp_GranuleNB PFCNeuron PFC_ExciteNeuron,PFC_InhibNeuron
HippNonNeuron Hipp_Astro,Hipp_ChorPlex,Hipp_Ependymal,Hipp_Myeloid,Hipp_NIPCs,Hipp_Oligo,Hipp_RGL,Hipp_Vasc PFCNonNeuron PFC_Astro,PFC_Oligo,PFC_Myeloid,PFC_Vasc
HippExcite Hipp_ExciteNeuron PFCExcite PFC_ExciteNeuron
HippInhib Hipp_InhibNeuron PFCInhib PFC_InhibNeuron
# 差異分析
config <- c("userInput/config") # 仍然是文件路徑
DiffSplicingAnalysis(configFile = config,typeOfTest = 'Iso') 
# Iso表示 Full-length isoform
# 這里typeOfTest還可以選擇TSS、PolyA、Exon(這個(gè)需要跑5b的步驟,具體看說明書)

第七步: Visualizations

該函數(shù)繪制了一個(gè)三角形熱圖,表示一個(gè)目錄中兩兩比較的DIE基因百分比。condensedCellTypes 是細(xì)胞類型list,如下所示:

P7Hipp_Astro
P7Hipp_ChorPlex
P7Hipp_DivOligo
P7Hipp_Ependymal
P7Hipp_ExcitNeuron
P7Hipp_GranuleNB
P7Hipp_InhibNeuron
P7Hipp_Myeloid
P7Hipp_NIPCs
P7Hipp_Oligo
P7Hipp_RGL
P7Hipp_Vasc

以上的cell type要對(duì)應(yīng)于AllInfo文件里的 barcode-celltype list。如果輸入的cell type類型沒有計(jì)算成對(duì)DIE,則它將輸出全零(比如這里的DivOligo)

# 可視化
condensedCellTypes =c("userInput/condensedCellTypes") #文件路徑
triHeatmap(treeDir = 'TreeTraversal_Iso/',
           comparisonList = condensedCellTypes,
           typeOfTest = "Iso",
           outName = condensedCellTypes)
熱圖
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容