久草在线青青草,大鸡巴Hb,中文字幕一区.com

---我們實(shí)驗(yàn)室是研究雪蓮的，可以做10X單細(xì)胞轉(zhuǎn)錄組嗎？
···可以

--- 我們實(shí)驗(yàn)室前幾年做了雪蓮的基因組，沒(méi)有發(fā)表，師兄做的，不知道質(zhì)量怎么樣，可以做10X單細(xì)胞轉(zhuǎn)錄組嗎？
···可以

---我們實(shí)驗(yàn)室做的雪蓮三代轉(zhuǎn)錄組，有一個(gè)基因組，基于這個(gè)可以做10X單細(xì)胞轉(zhuǎn)錄組嗎？
···可以

所以說(shuō)，基因組是生命科學(xué)實(shí)驗(yàn)室基礎(chǔ)建設(shè)的一部分，在不遠(yuǎn)的將來(lái)，單細(xì)胞也會(huì)是。

要回答上述問(wèn)題，首先要明白的一點(diǎn)就是：基因組是什么？

基因組主要有兩個(gè)文件：

fa序列文件

>15 dna:chromosome chromosome:GRCh38:15:1:101991189:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

gtf注釋文件

!genome-build GRCh38.p12
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.27
#!genebuild-last-updated 2018-01
1       havana  gene    29554   31109   .       +       .       gene_id "ENSG00000243485"; gene_version "5"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"
1       havana  transcript      29554   31097   .       +       .       gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000473358"; transcript_version "1"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-
1       havana  exon    29554   30039   .       +       .       gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000473358"; transcript_version "1"; exon_number "1"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name 
1       havana  exon    30564   30667   .       +       .       gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000473358"; transcript_version "1"; exon_number "2"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name 
1       havana  exon    30976   31097   .       +       .       gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000473358"; transcript_version "1"; exon_number "3"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name 
1       havana  transcript      30267   31109   .       +       .       gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000469289"; transcript_ve

組裝

序列文件就是基因組的序列以fa格式存儲(chǔ)，這里我們看到在GRCh38版本中染色體兩端加了很多N。
從序列文件我們可以得到什么？

組裝水平：染色體，contig，還是scaffold水平？
組裝質(zhì)量評(píng)估：

動(dòng)植物基因組de novo工作，其組裝指標(biāo)的好壞直接影響著整個(gè)基因組的質(zhì)量。而評(píng)估基因組組裝結(jié)果，contigN50和scaffoldN50是第一指標(biāo)，即contig/ scaffoldN50：將contig/scaffold長(zhǎng)度從長(zhǎng)到短進(jìn)行排序并累加，當(dāng)累加和達(dá)到contig/scaffold總長(zhǎng)度的50%的時(shí)候，最后參與加和的那一條contig/scaffold長(zhǎng)度即為contig/ scaffoldN50的長(zhǎng)度。一般來(lái)說(shuō)，contig/scaffoldN50越長(zhǎng)，表示組裝結(jié)果越好。

但是，N50指標(biāo)高就意味著組裝結(jié)果就一定可靠嗎？

不一定！將一些不相關(guān)的reads或者contig錯(cuò)誤的連接為scaffold，一樣可以達(dá)到很高的scaffoldN50。

目前高水平文章發(fā)表，組裝指標(biāo)固然是一方面，但真正決定文章發(fā)表檔次的，是生物學(xué)故事是否足夠完美，有亮點(diǎn)。我們知道，后續(xù)分析依賴的基礎(chǔ)便是組裝得到的基因組，因此，不可靠的組裝結(jié)果，對(duì)基因組后續(xù)分析會(huì)造成很大的困擾，甚至?xí)贸鲥e(cuò)誤的生物學(xué)結(jié)論。

那么，如何才能檢驗(yàn)一個(gè)基因組組裝結(jié)果的可靠性呢？

1、序列一致性評(píng)估：

基因組是通過(guò)reads組裝得到，這一步，是將reads比到基因組上，驗(yàn)證reads對(duì)基因組的覆蓋情況，用于評(píng)估組裝的完整性以及測(cè)序的均勻性。較高的mapping rate（90%以上）以及coverage（95%以上）認(rèn)為組裝結(jié)果和reads有比較好的一致性。

2、序列完整性評(píng)估：

所謂完整性評(píng)估，即評(píng)估組裝得到的基因組對(duì)基因區(qū)的覆蓋程度，一般需要借助RNA方面的證據(jù)進(jìn)行評(píng)估，如EST數(shù)據(jù)或RNA reads。由于用來(lái)評(píng)估的RNA方面證據(jù)不同，得到的比例也會(huì)有差別。一般來(lái)說(shuō)，50%的scaffold覆蓋基因的95%以上，85%的scaffold覆蓋基因的90%以上，認(rèn)為組裝較完整。

3、準(zhǔn)確性評(píng)估：

通過(guò)全長(zhǎng)BAC序列，可以通過(guò)與組裝結(jié)果的比對(duì)，對(duì)組裝結(jié)果的正確性進(jìn)行驗(yàn)證，從BAC序列和scaffold是否具有較好的一致性來(lái)判斷組裝質(zhì)量。

4、保守性基因評(píng)估：

即根據(jù)廣泛存在于大量真核生物中的保守蛋白家族集合（248個(gè)core gene庫(kù)），對(duì)組裝得到基因組進(jìn)行評(píng)估，評(píng)估組裝基因組中的core gene的準(zhǔn)確性和完整性?？梢酝ㄟ^(guò)該物種和同源物種cegma的比例，判斷保守基因組裝情況。

有沒(méi)有現(xiàn)成的方法來(lái)評(píng)估呢？

有的，LAI: 評(píng)估基因組質(zhì)量一個(gè)標(biāo)準(zhǔn)

得到的LAI值按照如下評(píng)估標(biāo)準(zhǔn)進(jìn)行分類：

Category	LAI	Examples
Draft	0 ≤ LAI < 10	Apple (v1.0), Cacao (v1.0)
Reference	10 ≤ LAI < 20	Arabidopsis (TAIR10), Grape (12X)
Gold	20 ≤ LAI	Rice (MSUv7), Maize (B73 v4)

注釋

注釋就是以位置信息來(lái)注明基因組的序列每一段都是什么功能（一種描述）。

那么，如何對(duì)基因組序列進(jìn)行注釋呢？基因組組裝完成后，或者是完成了草圖，就不可避免遇到一個(gè)問(wèn)題，需要對(duì)基因組序列進(jìn)行注釋。注釋之前首先得構(gòu)建基因模型，有三種策略：

從頭注釋(de novo prediction)：通過(guò)已有的概率模型來(lái)預(yù)測(cè)基因結(jié)構(gòu)，在預(yù)測(cè)剪切位點(diǎn)和UTR區(qū)準(zhǔn)確性較低
同源預(yù)測(cè)(homology-based prediction)：有一些基因蛋白在相近物種間的保守型搞，所以可以使用已有的高質(zhì)量近緣物種注釋信息通過(guò)序列聯(lián)配的方式確定外顯子邊界和剪切位點(diǎn)
基于轉(zhuǎn)錄組預(yù)測(cè)(transcriptome-based prediction)：通過(guò)物種的RNA-seq數(shù)據(jù)輔助注釋，能夠較為準(zhǔn)確的確定剪切位點(diǎn)和外顯子區(qū)域。

在高通量測(cè)序的時(shí)代，基因組序列的獲得已經(jīng)不是難題了，但是每段序列的注釋依然需要也是值得花一些精力的。

我的基因組可以做10X單細(xì)胞轉(zhuǎn)錄組了嗎？

在對(duì)基因組有了基本的認(rèn)識(shí)之后，我們來(lái)回答這個(gè)問(wèn)題。

Cell Ranger uses an aligner called STAR, which peforms splicing-aware alignment of reads to the genome. Cell Ranger then uses the transcript annotation GTF to bucket the reads into exonic, intronic, and intergenic, and by whether the reads align (confidently) to the genome. A read is exonic if at least 50% of it intersects an exon, intronic if it is non-exonic and intersects an intron, and intergenic otherwise.

基本的注釋信息：

Column	Name	Description
1	Chromosome	Must refer to a chromosome/contig in the genome fasta.
2	Source	Unused.
3	Feature	cellranger count `only uses rows where this line is exon`.
4	Start	Start position on the reference (1-based inclusive).
5	End	End position on the reference (1-based inclusive).
6	Score	Unused.
7	Strand	Strandedness of this feature on the reference: + or -.
8	Frame	Unused.
9	Attributes	A semicolon-delimited list of key-value pairs of the form key "value". The attribute keys `transcript_id` and `gene_id`are required; gene_name is optional and may be non-unique, but if present will be preferentially displayed in reports.

也就是注釋信息中必須要有exon，transcript_id，gene_id ，這個(gè)是做10X單細(xì)胞轉(zhuǎn)錄組對(duì)一個(gè)基因組最基本的要求。能組裝到染色體水平當(dāng)然更好，組裝不到的話也可以。

有了fa以及gtf文件之后，我們就可以用cellrang的mkerf流程來(lái)構(gòu)建10X專用的參考基因組了：

cellranger mkref --genome=output_genome --fasta=input.fa --genes=input.gtf

構(gòu)建好之后，是這樣的：

genome_output/
├── fasta
│   └── genome.fa
├── genes
│   └── genes.gtf
├── pickle
│   └── genes.pickle
├── reference.json
└── star # STAR genome index folder

For the genome sequence, include all major chromosomes, unplaced and unlocalized scaffolds, but do not include patches and alternative haplotypes.
- In Ensembl, the recommended genome file to download is annotated as "primary assembly." - In NCBI, it is "no alternative - analysis set."
For the GTF file, genes must be annotated with feature type 'exon' (column 3). - Prior to mkref, GTF annotation files from Ensembl and NCBI are typically filtered with mkgtf to include only a subset of the annotated gene biotypes.

Creating a Reference Package with cellranger mkref