軟件——scisorseqr:https://github.com/noush-joglekar/scisorseqr
前期工作準(zhǔn)備:
1.二代數(shù)據(jù)跑完cellranger,得到barcode、feature、matrix文件。輸入到seurat進(jìn)行數(shù)據(jù)質(zhì)控、細(xì)胞分群、細(xì)胞類型注釋
2.如果沒有二代數(shù)據(jù),只有三代數(shù)據(jù),也可以跑sicelore、Isosceles等軟件。得到[isoform x cell] matrix后輸入到seurat。
安裝
軟件依賴:
- STARlong for PacBio reads
- Minimap2 for Oxford Nanopore (or PacBio) reads
- samtools
- bedtools
- python version >=3.7
- r: devtools,Seurat,tidyr,dplyr
devtools::install_github('noush-joglekar/scisorseqr',build_vignettes = FALSE)
# 教程教的是build_vignettes = TRUE,但是一般都是在linux下跑這個(gè)軟件,build_vignettes需要打開瀏覽器。就很麻煩。github里面自帶教程,可以直接下,沒必要再生成
# https://github.com/noush-joglekar/scisorseqr/blob/master/vignettes/scisorseqr-StandardWorkflow.pdf
# 如果你在windows的R下載的話,可以選擇build_vignettes = TRUE
# 但有可能會(huì)報(bào)錯(cuò):pdflatex: command not found
報(bào)錯(cuò)的話安裝以下:
install.packages("tinytex")
tinytex::install_tinytex()
雖然有教程,但是還是有點(diǎn)簡(jiǎn)略。建立起自己的認(rèn)識(shí),還是寫一寫。預(yù)先告知,我跑到第四步就跑不下去了,后面幾步我也不知道會(huì)遇到什么報(bào)錯(cuò)結(jié)果orz
第一步: Barcode deconvolution
library(Seurat)
library(tidyr)
library(dplyr)
library(scisorseqr)
# 讀入前期細(xì)胞注釋好的seurat object
sce.all<-readRDS("sce.all.final.cellname.rds")
# 獲取barcode 和細(xì)胞類型對(duì)應(yīng)的表
bc <- data.frame("Barcode"=rownames(sce.all@meta.data),
"Celltype"=sce.all@meta.data$seurat_annotations) %>%
separate(Barcode,into=c("BC","suffix"),sep="-") %>%
select(-suffix) %>% as.data.frame()
# 保存一下
write.table(bc, file = "bc_celltype_assignments.txt", sep="\t", row.names = FALSE,
col.names = FALSE, quote = FALSE)
head(bc)
> BC Celltype
> AAACCTGAGAATGTGT Pro-B_cell_CD34+
> AAACCTGAGATGGGTC Pro-B_cell_CD34+
> AAACCTGAGCAGGTCA Pro-B_cell_CD34+
> AAACCTGAGGCTCATT Pro-B_cell_CD34+
> AAACCTGAGTCAAGCG Monocytes
> AAACCTGCAAAGGAAG Pro-B_cell_CD34+
bc_clust <- c("scisorseqr/bc_celltype_assignments.txt")
fastqFolder <- c("scisorseqr/fastqFile/OC_SRR25278608/")
########### 非常多的坑啊啊啊啊啊 ############
有可能有些數(shù)據(jù)的barcode是普通單細(xì)胞數(shù)據(jù),用的是v3,也有的是單細(xì)胞ATAC-seq,用的是ARC-v1。
1. 有哪些barcode?如下
- https://kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist
2. 我下的是公共數(shù)據(jù),我不知道它用的是什么barcode。
- cellranger識(shí)別barcode的時(shí)候默認(rèn)是“auto”自動(dòng)識(shí)別。如果成功跑下來沒報(bào)錯(cuò),那么會(huì)輸出一個(gè)sample/out/web_summary.html的文件,里面就有說這個(gè)文件用的是什么barcode集
3. fqFolder:fastq.gz所在的“文件夾”路徑(最后面要加上“/”,不然也會(huì)報(bào)錯(cuò)!),而不是文件夾里的fastq.gz文件。
- 文件夾里只能有一個(gè)fastq,且和文件夾同名。e.g.文件夾:ABC;文件:ABC.fastq.gz
- 否則就會(huì)報(bào)錯(cuò)“cat: 'getBarcodeOutput/OutputRaw//DeconvBC*.csv': No such file or directory”
4. bc_clust:上一步得到的bc_celltype_assignments.txt所在路徑,而不是讀入文件的data.frame
- 否則就會(huì)報(bào)錯(cuò)
“cat: './OutputRaw//tmp*': No such file or directory
rm: cannot remove './OutputRaw//tmp*': No such file or directory
awk: cmd. line:7: fatal: division by zero attempted
Warning message:
In system(run_bc_deConv) : error in running command”
5. 如果你有多個(gè)文件(replicate)只能一個(gè)一個(gè)處理,所以可以寫個(gè)for循環(huán)....
################# 非常多的坑啊啊啊啊啊 ############
## run command
GetBarcodes(fqFolder = fastqFolder, BCClustAssignFile = bc_clust,
chemistry = "v3", concatenate = FALSE, filterReads = FALSE)
GetBarcodes(fqFolder = fastqFolder, BCClustAssignFile = bc_clust,
chemistry = "ARC-v1", concatenate = FALSE, filterReads = FALSE)
第二步: Align your fastq files to the reference
這個(gè)我自己跑minimap2了,因?yàn)檫€有其他東西要跑,所以之前就跑好了這個(gè)文件。
軟件里跑的minimap2參數(shù)是-ax splice --secondary=no ,供參考
軟件跑minimap2的結(jié)果放在MMoutput的文件夾下,里面有bam文件和REPORT.minimap2
并且軟件這步還會(huì)創(chuàng)一個(gè)Misc的文件夾,放了fastqGuide和bamGuide兩個(gè)文件。
- bamGuide:bam文件絕對(duì)路徑
- fastqGuide:兩列,第一列為fastq名,\t,第二列為fastq.gz絕對(duì)路徑。
如果用自己之前就跑好的bam文件的話需要自行創(chuàng)好以上文件
MMalign(fqFolder = fastqFolder, mmProgPath = 'miniconda3/envs/scisorseqr/bin/minimap2',refGenome='genomes/hg38.fa',numThreads = 16)
第三步: Map and filter for full-length, spliced reads
這里需要一個(gè)按染色體分解參考基因組的路徑seqDir,可以用faidx解決
pip install pyfaidx
mkdir chr_fa
cd chr_fa
faidx -x PATH/TO/hg38.fa
gzip ./*
MapAndFilter('LRoutput',annoGZ='database/hg38.gtf.gz',numThreads=24,seqDir="chr_fa",genomeVersion="hg38")
報(bào)錯(cuò):
gzip: LRoutput/mapping.bestperRead.RNAdirection.withConsensIntrons.introns.gff.gz: No such file or directory
gzip: LRoutput/mapping.bestperRead.RNAdirection.withConsensIntrons.gff.gz: No such file or directory
gzip: LRoutput/newIsoforms_vs_Anno_ignoreAnno/stretches: No such file or directory
mv: cannot stat 'tmpDir/mapping.bestperRead.bam': No such file or directory
原因:
所在文件夾和各文件夾位置錯(cuò)了。腳本是按相對(duì)路徑尋找文件,所以各文件的結(jié)構(gòu)非常重要。需要調(diào)整文件夾成如下結(jié)構(gòu),bamGuide內(nèi)容也需要改動(dòng):
并且這一步需要在mmalign目錄下跑
|-- test_scisorseqr
| |-- bc_celltype_assignments
| |-- test_scisorseqr.fastq.gz
| |-- output
| | | |-- OutputFiltered
| | | | |-- FilteredDeconvBC_AllFiles.csv
| | | |-- OutputRaw
| | | | |-- AllFiles_Summary
| | | | |-- DeconvBC_TestHipp.csv
| | | | |-- DeconvBC_TestHipp_summary
| | | | |-- TooShort.csv
| | | | |-- DeconvBC_TestPFC.csv
| | | | |-- DeconvBC_TestPFC_summary
| |-- mmalign
| | |-- MMoutput
| | | |-- REPORT.minimap
| | | |-- TestHipp.bam
| | | |-- TestPFC.bam
| | |-- Misc
| | | |-- bamGuide
| | | |-- fastqGuide
以下是我的輸出,供參考(這步跑好之后最好把LRoutput備份一份,免得后面又報(bào)錯(cuò),會(huì)把里面的文件給刪掉........又得重新跑)
bedtools found in path
samtools found in path
#################################
# RUNNING [miniconda3/envs/scisorseqr/lib/R/library/scisorseqr/bash/mapAndAlignReads.sh]
# Current date: Fri Nov 15 11:12:58 AM CST 2024
# Current dir: mmalign
+++++++++++++++++++++++++ 1. arguments
fastqGuide=Misc/fastqGuide
outdir=LRoutput
tmpDir=tmpDir
numThreads=24
seqDirectory=database/chr_fa
annoGZ=database/hg38.gtf.gz
mappingBAMGuide=Misc/bamGuide
scriptDir=miniconda3/envs/scisorseqr/lib/R/library/scisorseqr/bash
genome + version=hg38
++++++++++++++++++ 1b. deduced from arguments
++++++++++++++++++ 1c. input check
[bam_merge] File 'tmpDir/mapping.bam' exists. Please apply '-f' to overwrite. Abort.
+++++++++++++++++++++ 2b. sorting annotation and projection
+++++++++++++++++++++ 2b1 sorting annotation
real 1m6.919s
user 0m32.154s
sys 0m4.844s
++++++++++++++++++ 2c. how many were well mapped
+++++++++++++++ 2c1 finding best hits
wellMapped: mmalign/MMoutput/Hipp_SN_f1_ONT_Run1.bam 20993299 867037 274931 2532303 0.886197
+++++++++++++++ 2c2 sam to bam conversion
+++++++++++++++++++++ 2d. removing ribosomal RNA hits
++++++++++++++++++ 2d1. finding ribosomal RNA hits
++++++++++++++++++ 2d2. removing ribosomal RNA hits
++++++++++++++++++++++++ 3. further analysis
+++++++++++++++++++++ 3a. checking consensus
++++++++++++++++++ 3a1. getting introns and exons in gff format
real 14m32.392s
user 11m23.011s
sys 0m35.452s
real 17m56.608s
user 10m38.717s
sys 0m34.499s
++++++++++++++++++ 3a2. getting dinucleotides at intron borders
++++++++++++++++++ 3a2.a. preparing commands for parallel execution
database/chr_fa/chr1.fa.gz
database/chr_fa/chr10.fa.gz
......
database/chr_fa/chrY.fa.gz
+++++++++++++++ 3a2b execution
## 0. params
# 0a. read params
# 0b. print params
commandFile=LRoutput/parallel.comp.anno.guide.3.siteSeq
n=24
## 1. reading organizational data
## 2. execution
real 3m8.427s
user 45m43.905s
sys 7m17.967s
+++++++++++++++ 3a2c collecting results and removing temporary files
cat tmpDir/site_sequence.chr1 tmpDir/site_sequence.chr10 tmpDir/site_sequence.chr10_GL383545v1_alt tmpDir/site_sequence.chr10_KI270824v1_alt tmpDir/site_sequence.chr10_KQ090021v1_fix ......
tmpDir/site_sequence.chr10_ML143354v1_fix tmpDir/site_sequence.chr10_ML143355v1_fix
tmpDir/site_sequence.chrY
real 6m40.000s
user 6m20.356s
sys 0m4.443s
+++++++++++++++++++++ 3b. strand correction and follow up analysis
++++++++++++++++++ 3b1. getting mapping.bestperRead.noRiboRNA.bam in gff format
real 12m29.409s
user 11m48.126s
sys 0m27.614s
real 4m15.684s
user 3m26.646s
sys 0m3.599s
awk: cmd. line:1: warning: regexp escape sequence `\"' is not a known regexp operator
++++++++++++++++++ 3b3. getting introns
awk: cmd. line:2: warning: regexp escape sequence `\@' is not a known regexp operator
real 7m58.682s
user 6m28.702s
sys 0m23.712s
++++++++++++++++++ 3b4. getting genes
### executing pB.getGenes.awk
## A. parsing file2=LRoutput/mapping.bestperRead.RNAdirection.withConsensIntrons.introns.gff
## B. parsing annotation:
## C. going over all annotated transcripts:
## D. counting the number of splice sites per gene:
## E. checking whether a read has equal numbers of splice sites with multiple genes:
real 85m17.427s
user 84m36.266s
sys 1m16.844s
+++++++++++++++++++++++++ 4. zipping what has not be zipped before
+++++++++++++++++++++++++ 5. new isoforms
++++++++++++++++++++++ 5a. annotation
real 1m35.860s
user 1m39.012s
sys 0m1.997s
++++++++++++++++++++++ 5b. RNAseq
real 52m13.298s
user 56m2.148s
sys 0m50.638s
real 5m57.071s
user 5m20.826s
sys 0m14.671s
++++++++++++++++++++++ 5c. comparison
## 0. preliminaries
## 1. annotation: assuming non-zipped, genomically ordered gff in the format Nicholas provided
## 1a. parsing
## 1b. getting all the stretches
## 2. prediction: assuming non-zipped, genomically ordered gff in the format Nicholas provided
## 2a. parsing
## 2b. are the RNAseq stretches annotated ?
+++++++++++++++++++++++++ 6. done
第四步:Concatenate all the information collected so far
目前還是在mmalign目錄下
# 整理所有結(jié)果
InfoPerLongRead(barcodeOutputFile =bc_clust,
mapAndFilterOut = 'LRoutput/',
minTimesIsoObserve = 5, rmTmpFolder = FALSE)
如果這里的rmTmpFolder設(shè)置成TRUE會(huì)報(bào)錯(cuò)...我也不知道為什么
在這步我沒有輸出任何信息。但是也沒有報(bào)錯(cuò)。有可能是因?yàn)槲逸斎氲膄astq.gz是sce.obj的一部分細(xì)胞數(shù)據(jù),而不是全部。這個(gè)我實(shí)在解決不了了,只能終止到這步了。
第五步:Quantify the various isoforms observed in your data
IsoQuant(AllInfoFile = 'LongReadInfo/AllInfo',Iso = TRUE)
第六步: Differential isoform analysis
contig文件內(nèi)容是想比較的細(xì)胞類型,如下:
| Comparison1Celltypes | Comparison2Celltypes | ||
|---|---|---|---|
| HippNeuron | Hipp_ExciteNeuron,Hipp_InhibNeuron,Hipp_GranuleNB | PFCNeuron | PFC_ExciteNeuron,PFC_InhibNeuron |
| HippNonNeuron | Hipp_Astro,Hipp_ChorPlex,Hipp_Ependymal,Hipp_Myeloid,Hipp_NIPCs,Hipp_Oligo,Hipp_RGL,Hipp_Vasc | PFCNonNeuron | PFC_Astro,PFC_Oligo,PFC_Myeloid,PFC_Vasc |
| HippExcite | Hipp_ExciteNeuron | PFCExcite | PFC_ExciteNeuron |
| HippInhib | Hipp_InhibNeuron | PFCInhib | PFC_InhibNeuron |
# 差異分析
config <- c("userInput/config") # 仍然是文件路徑
DiffSplicingAnalysis(configFile = config,typeOfTest = 'Iso')
# Iso表示 Full-length isoform
# 這里typeOfTest還可以選擇TSS、PolyA、Exon(這個(gè)需要跑5b的步驟,具體看說明書)
第七步: Visualizations
該函數(shù)繪制了一個(gè)三角形熱圖,表示一個(gè)目錄中兩兩比較的DIE基因百分比。condensedCellTypes 是細(xì)胞類型list,如下所示:
P7Hipp_Astro
P7Hipp_ChorPlex
P7Hipp_DivOligo
P7Hipp_Ependymal
P7Hipp_ExcitNeuron
P7Hipp_GranuleNB
P7Hipp_InhibNeuron
P7Hipp_Myeloid
P7Hipp_NIPCs
P7Hipp_Oligo
P7Hipp_RGL
P7Hipp_Vasc
以上的cell type要對(duì)應(yīng)于AllInfo文件里的 barcode-celltype list。如果輸入的cell type類型沒有計(jì)算成對(duì)DIE,則它將輸出全零(比如這里的DivOligo)
# 可視化
condensedCellTypes =c("userInput/condensedCellTypes") #文件路徑
triHeatmap(treeDir = 'TreeTraversal_Iso/',
comparisonList = condensedCellTypes,
typeOfTest = "Iso",
outName = condensedCellTypes)
