2. Nanopore三代測(cè)序fastq基因組比對(duì)

因?yàn)轭A(yù)處理會(huì)過濾掉90%以上的reads,所以分別用原始數(shù)據(jù)和過濾后的數(shù)據(jù)進(jìn)行基因組比對(duì),看是否有差異。

一、先用未經(jīng)過預(yù)處理的fastq文件進(jìn)行比對(duì)(Test)

  1. 查看文件里有多少讀段
cat BC07.fastq|grep "^@"|wc -l
818523
cat BC08.fastq|grep "^@"|wc -l
502760
  1. 建立比對(duì)數(shù)據(jù)庫(kù)
    (1)準(zhǔn)備基因序列和基因注釋文件
gunzip Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz
gunzip gencode.vM36.annotation.gtf.gz

(2)Hisat2比對(duì)RNA-seq數(shù)據(jù)

師兄說Hisat2并不是一個(gè)很好用的比對(duì)工具,對(duì)于三代測(cè)序的長(zhǎng)reads,minimap2才是更好的比對(duì)工具。

  • 安裝Hisat2軟件
conda install -c bioconda hisat2
  • 使用Hisat2構(gòu)建索引
hisat2-build -p 8 Mus_musculus.GRCm39.dna_sm.primary_assembly.fa GRCm39_index

GRCm39_index → 生成的索引前綴(將創(chuàng)建多個(gè)索引文件)

  • Hisat2比對(duì)單端reads
hisat2 -p 8 -x ../../../genome/mouse_genome/GRCm39_index -U BC07.fastq -S BC07.sam
802807 reads; of these:
  802807 (100.00%) were unpaired; of these:
    802224 (99.93%) aligned 0 times
    462 (0.06%) aligned exactly 1 time
    121 (0.02%) aligned >1 times
0.07% overall alignment rate 

hisat2 -p 8 -x ../../../genome/mouse_genome/GRCm39_index -U BC08.fastq -S BC08.sam
479463 reads; of these:
  479463 (100.00%) were unpaired; of these:
    479102 (99.92%) aligned 0 times
    348 (0.07%) aligned exactly 1 time
    13 (0.00%) aligned >1 times
0.08% overall alignment rate 
#比對(duì)效率極低
  1. 比對(duì)結(jié)果分析(Samtools)
    (1)samtools生成bam二進(jìn)制文件,并建立索引
samtools view -@ 8 -bS BC07.sam | samtools sort -@ 8 -o BC07_sorted.bam
samtools index BC07_sorted.bam

samtools view -@ 8 -bS BC08.sam | samtools sort -@ 8 -o BC08_sorted.bam
samtools index BC08_sorted.bam

(2)通過samtools輸出比對(duì)的統(tǒng)計(jì)信息

samtools flagstat BC07_sorted.bam                                
802965 + 0 in total (QC-passed reads + QC-failed reads)
802807 + 0 primary
158 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
741 + 0 mapped (0.09% : N/A)
583 + 0 primary mapped (0.07% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

samtools flagstat BC08_sorted.bam
479481 + 0 in total (QC-passed reads + QC-failed reads)
479463 + 0 primary
18 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
379 + 0 mapped (0.08% : N/A)
361 + 0 primary mapped (0.08% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
#為什么比對(duì)效率這么低呢?
  1. IGV可視化
    (1)打開IGV軟件,選擇Mouse基因組


    IGV

    (2)導(dǎo)入:File>Load from file...>.bam文件(.bam.bai要和.bam文件在同一目錄下)

  • 因?yàn)楸葘?duì)上的序列數(shù)量太少了,用samtools查看比對(duì)序列在染色體上的位置
samtools idxstats BC07_sorted.bam
1       195154279       30      0
10      130530862       15      0
11      121973369       29      0
12      120092757       210     0
13      120883175       11      0
14      125139656       14      0
15      104073951       20      0
16      98008968        23      0
17      95294699        122     0
18      90720763        9       0
19      61420004        6       0
2       181755017       26      0
3       159745316       12      0
4       156860686       25      0
5       151758149       16      0
6       149588044       17      0
7       144995196       26      0
8       130127694       25      0
9       124359700       34      0
MT      16299   45      0
X       169476592       8       0
Y       91455967        0       0

samtools view BC07_sorted.bam | awk '$3=="12" {print $3, $4}' | sort | uniq -c | sort -nr | head -n 10
     77 12 69206069
     13 12 69408101
      9 12 69408102
      7 12 69408093
      4 12 69408136
      4 12 69206071
      3 12 69408139
      3 12 69408134
      3 12 69408111
      3 12 69408108
  • 查看第12號(hào)染色體上比對(duì)的位置,把位置復(fù)制到IGV中即可查看


所以IGV可視化可以看到比對(duì)上的都是短片段,不是長(zhǎng)片段,這可能就和Hisat2的工具的不適合有關(guān)系。

二、用經(jīng)過預(yù)處理的fastq文件進(jìn)行比對(duì)(Test)

預(yù)處理fastq來源于“1. Nanopore三代測(cè)序fastq數(shù)據(jù)預(yù)處理”中的*output.fastq文件

  1. 查看預(yù)處理文件的reads數(shù)
cat BC07_output.fastq |grep "^@"|wc -l
105907

cat BC08_output.fastq |grep "^@"|wc -l
33509
  1. 進(jìn)一步去掉可能的rRNA污染(可省略)
  • 安裝SortMeRNA
conda install -c bioconda sortmerna

mkdir -p ~/sortmerna_db && cd ~/sortmerna_db
wget -c https://github.com/biocore/sortmerna/raw/master/data/rRNA_databases/silva-bac-16s-id90.fasta
wget -c https://github.com/biocore/sortmerna/raw/master/data/rRNA_databases/silva-bac-23s-id98.fasta
wget -c https://github.com/biocore/sortmerna/raw/master/data/rRNA_databases/silva-euk-18s-id95.fasta
wget -c https://github.com/biocore/sortmerna/raw/master/data/rRNA_databases/silva-euk-28s-id98.fasta
wget -c https://github.com/biocore/sortmerna/raw/master/data/rRNA_databases/rfam-5s-database-id98.fasta
wget -c https://github.com/biocore/sortmerna/raw/master/data/rRNA_databases/rfam-5.8s-database-id98.fasta
sortmerna --ref silva-bac-16s-id90.fasta \
          --ref silva-bac-23s-id98.fasta \
          --ref silva-euk-18s-id95.fasta \
          --ref silva-euk-28s-id98.fasta \
          --ref rfam-5s-database-id98.fasta \
          --ref rfam-5.8s-database-id98.fasta \
          --reads ../nanopore-data/FMR1/barcode-241126/BC08_output.fastq \
          --aligned ../nanopore-data/FMR1/barcode-241126/BC08_output_rRNA \
          --other ../nanopore-data/FMR1/barcode-241126/BC08_output_non_rRNA \
          --fastx

sortmerna --ref silva-bac-16s-id90.fasta \
          --ref silva-bac-23s-id98.fasta \
          --ref silva-euk-18s-id95.fasta \
          --ref silva-euk-28s-id98.fasta \
          --ref rfam-5s-database-id98.fasta \
          --ref rfam-5.8s-database-id98.fasta \
          --reads ../nanopore-data/FMR1/barcode-241126/BC07_output.fastq \
          --aligned ../nanopore-data/FMR1/barcode-241126/BC07_output_rRNA \
          --other ../nanopore-data/FMR1/barcode-241126/BC07_output_non_rRNA \
          --fastx
  • 再次檢查處理后的reads數(shù)
cat BC07_output_non_rRNA.fq |grep "^@"|wc -l
68184

cat BC08_output_non_rRNA.fq |grep "^@"|wc -l
21547

BC08去掉了60%的rRNA,BC07去掉了20%的rRNA,但是BC07是input樣品,可能BC08就是富集了rRNA的mRNA,所以考慮不用去除rRNA基因組的fastq文件,直接使用*output.fastq文件。

  1. Hisat2比對(duì)RNA-seq數(shù)據(jù)
hisat2 -p 8 -x ../../../genome/mouse_genome/GRCm39_index -U BC08_output.fastq -S BC08_output.sam
32834 reads; of these:
  32834 (100.00%) were unpaired; of these:
    32682 (99.54%) aligned 0 times
    144 (0.44%) aligned exactly 1 time
    8 (0.02%) aligned >1 times
0.46% overall alignment rate

hisat2 -p 8 -x ../../../genome/mouse_genome/GRCm39_index -U BC07_output.fastq -S BC07_output.sam
105179 reads; of these:
  105179 (100.00%) were unpaired; of these:
    104630 (99.48%) aligned 0 times
    350 (0.33%) aligned exactly 1 time
    199 (0.19%) aligned >1 times
0.52% overall alignment rate
  • 預(yù)處理和不預(yù)處理基因組比對(duì)的reads比較:
  1. samtools生成bam二進(jìn)制文件,并建立索引
samtools view -@ 8 -bS BC07_output.sam | samtools sort -@ 8 -o BC07_output_sorted.bam          
samtools index BC07_output_sorted.bam

samtools view -@ 8 -bS BC08_output.sam | samtools sort -@ 8 -o BC08_output_sorted.bam
samtools index BC08_output_sorted.bam
  1. IGV可視化(同前)

主要問題:

  • 經(jīng)過質(zhì)量控制后能比對(duì)上的reads比例上升了,但是總數(shù)下降了
  • 不知道是不是RIP建庫(kù)的問題,基因組比對(duì)的方法能比對(duì)上的reads數(shù)太太太太少了
  • 按同樣的流程把BC07和BC08的重復(fù)數(shù)據(jù)再跑一遍pipeline。
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容