簡介
??單細胞領(lǐng)域目前算是進入白熱化的階段,自09年第一篇單細胞文章發(fā)表已經(jīng)度過了10個年頭,在通量和成本做到較好的平衡是Department of Genetics, Harvard Medical School,mccarroll實驗室15年開發(fā)的基于液滴微流控系統(tǒng)生成單細胞Drop-seq技術(shù)。目前商業(yè)化最成功的10Xgenomics公司開發(fā)的單細胞產(chǎn)品也是在此技術(shù)的基礎(chǔ)上優(yōu)化而成。HCA人類細胞圖譜計劃得益與目前這些技術(shù),快速實現(xiàn)了人類多個組織器官的單細胞圖譜繪制。本文簡要介紹Drop-seq實驗的原理及數(shù)據(jù)分析工具Drop-seq_tools流程要點。

Drop-seq實驗
??Drop-seq主要核心部件為增壓裝置及液滴微流控合成裝置,原理利用壓力將磁珠,細胞,礦物油三種材料,壓到微通道里,實現(xiàn)三個物質(zhì)混成一個大液滴完成單細胞分離標記,完成后續(xù)建庫。


Drop-seq_tools分析流程
??不同的實驗方案生成的單細胞數(shù)據(jù),分析上整體差別不大,無外乎就是原始數(shù)據(jù)過濾質(zhì)控,然后比對到基因組上,根據(jù)UMI及細胞barcode實現(xiàn)基因

??Drop-seq_tools配合Picardtools,samtools STAR等工具實現(xiàn)對fq bam sam 等文件處理。流程首先將構(gòu)建參考基因組文件夾,對物種的參考基因組fa及gtf文件做處理。
● fasta : The reference sequence of the organism. Needed for most aligners.
● dict : A dictionary file as generated by Picard’s CreateSequenceDictionary . Needed for Picard Tools.
● gtf: The principle file to determine the location of genomic features like genes, transcripts, and exons. Many other metadata files we use derive from this original file. We download our GTF files from ensembl, which has a handy description of the file format here . Ensembl has a huge number of prepared GTF files for a variety of organisms here .
● refFlat: This file contains a subset of the the same information in the GTF file in a different format. Picard tools like the refFlat format, so we require this as well. To make life easy, we provide a program ConvertToRefFlat that can convert files from GTF format to refFlat for you.
● genes.intervals: The genes from the GTF file in interval list format . This file is optional, and useful if you want to go back to your BAM later to see what gene(s) a read aligns to.
● exons.intervals: The exons from the GTF file in interval list format . This file is optional, and useful if you want to go back to your BAM and view what exon(s) a read aligns to.
● rRNA.intervals: The locations of ribosomal RNA in interval list format . This file is optional, but we find it useful to later assess how much of a dropseq library aligns to rRNA.
● reduced.gtf: This file contains a subset of the information in the GTF file, but in a far more human readable format. This file is optional, but can be generated easily by the supplied ReduceGTF program that will take a GTF file as input.
然后用Picard FastqToSam將下機文件轉(zhuǎn)換成sam/bam
隨后就可以利用drop-seq 的工具對bam文件做進一步的處理包括:
Example Cell Barcode:
TagBamWithReadSequenceExtended
INPUT=my_unaligned_data.bam
OUTPUT=unaligned_tagged_Cell.bam
SUMMARY=unaligned_tagged_Cellular.bam_summary.txt
BASE_RANGE=112
BASE_QUALITY=10
BARCODED_READ=1
DISCARD_READ=False
TAG_NAME=XC
NUM_BASES_BELOW_QUALITY=1
Example Molecular Barcode:
TagBamWithReadSequenceExtended
INPUT=unaligned_tagged_Cell.bam
OUTPUT=unaligned_tagged_CellMolecular.bam
SUMMARY=unaligned_tagged_Molecular.bam_summary.txt
BASE_RANGE=1320
BASE_QUALITY=10
BARCODED_READ=1
DISCARD_READ=True
TAG_NAME=XM
NUM_BASES_BELOW_QUALITY=1
FilterBAM
TAG_REJECT=XQ
INPUT=unaligned_tagged_CellMolecular.bam
OUTPUT=unaligned_tagged_filtered.bam
TrimStartingSequence
TrimStartingSequence
INPUT=unaligned_tagged_filtered.bam
OUTPUT=unaligned_tagged_trimmed_smart.bam
OUTPUT_SUMMARY=adapter_trimming_report.txt
SEQUENCE=AAGCAGTGGTATCAACGCAGAGTGAATGGG
MISMATCHES=0
NUM_BASES=5
PolyATrimmer
PolyATrimmer
INPUT=unaligned_tagged_trimmed_smart.bam
OUTPUT=unaligned_mc_tagged_polyA_filtered.bam
OUTPUT_SUMMARY=polyA_trimming_report.txt
MISMATCHES=0
NUM_BASES=6
SamToFastq
java Xmx4g jar /path/to/picard/picard.jar SamToFastq
INPUT=unaligned_mc_tagged_polyA_filtered.bam
FASTQ=unaligned_mc_tagged_polyA_filtered.fastq
以上完成對比對前的數(shù)據(jù)做完處理,后續(xù)做比對處理。
STAR
/path/to/STAR/STAR
--genomeDir /path/to/STAR_REFERENCE
--readFilesIn unaligned_mc_tagged_polyA_filtered.fastq
--outFileNamePrefix star
SortSam
java Xmx4g jar /path/to/picard/picard.jar SortSam
I=starAligned.out.sam
O=aligned.sorted.bam
SO=queryname
MergeBamAlignment
java Xmx4g jar /path/to/picard/picard.jar MergeBamAlignment
REFERENCE_SEQUENCE=my_fasta.fasta
UNMAPPED_BAM=unaligned_mc_tagged_polyA_filtered.bam
ALIGNED_BAM=aligned.sorted.bam
OUTPUT=merged.bam
INCLUDE_SECONDARY_ALIGNMENTS=false
PAIRED_RUN=false
TagReadWithGeneExon
TagReadWithGeneExon
I=merged.bam
O=star_gene_exon_tagged.bam
ANNOTATIONS_FILE=${refFlat}
TAG=GE
DetectBeadSynthesisErrors Detecting and repairing barcode synthesis errors
DetectBeadSynthesisErrors
I=my.bam
O=my_clean.bam
OUTPUT_STATS=my.synthesis_stats.txt
SUMMARY=my.synthesis_stats.summary.txt
NUM_BARCODES= <roughly 2x the number of cells>
PRIMER_SEQUENCE=AAGCAGTGGTATCAACGCAGAGTAC
Digital Gene Expression
DigitalExpression
I=out_gene_exon_tagged.bam
O=out_gene_exon_tagged.dge.txt.gz
SUMMARY=out_gene_exon_tagged.dge.summary.txt
NUM_CORE_BARCODES=100
Cell Selection
BAMTagHistogram
I=out_gene_exon_tagged.bam
O=out_cell_readcounts.txt.gz
TAG=XC
后續(xù)的數(shù)據(jù)可以接入Seurat做后續(xù)的過濾分群處理。
參考材料:
https://github.com/broadinstitute/Drop-seq
http://mccarrolllab.org/dropseq/