先來熟悉一下數(shù)據(jù)
人的全外顯子組數(shù)據(jù),為了增加深度測了兩個lane。
drwxr-xr-x 4 huangsiyuan grp3 220 Oct 17 09:18 CL100072545_L01_44/
drwxr-xr-x 4 huangsiyuan grp3 220 Oct 11 14:56 CL100072545_L02_44/
huangsiyuan 21:21:01 ~/learn_wes/data_ren
$ lsx CL100072545_L01_44_1.fq.gz
@CL100072545L1C001R001_6/1
TTTTTCTGTGAATGTTTCTTTTCCCAGCTTCCCTGAAAGCAACCATGGCT
+
BF@DEF@GFB::9EFDGEFFAE?D;CC=FEFFEF6@9:1C:FCFEFC:AF
--------------------------------------------------
$ lsx CL100072545_L01_44_2.fq.gz
@CL100072545L1C001R001_6/2
CTCTGGGATGATTGGAATTGATCCTGTAGCTGTTTTCCGATGGGCAATTC
+
>F>FB9FCFF6DCF=EBFFFEFFDFFFEFFGFCGFFD?8FEFEDEFEFFF
fq1和fq2文件中的reads是一一對應(yīng)的,正好是雙端測序的兩端,這次測序的reads長度是50bp。
數(shù)據(jù)的質(zhì)控
之前我用的都是Trimmomatic, 這次換一個,用SOAPnuke。SOAPnuke是華大自主開發(fā)的一款針對fastq文件的過濾軟件,主要功能有adapter過濾、低quality過濾和高比例N過濾?;镜倪^濾功能集中在filter模塊中,filter模塊適用于大部分fastq格式下機數(shù)據(jù)過濾。
$ git clone https://github.com/BGI-flexlab/SOAPnuke.git
$ cd SOAPnuke/
$ ls
ChangeLog COPYING Makefile Readme.md src/
$ make
#這是2.X版本,下面的例子我還是用的1.5.6版本
#./SOAPnuke filter -h查看幫助文檔
詳細的使用方法見這篇帖子:fastq數(shù)據(jù)質(zhì)控過濾軟件-soapnuke,fastp
~/learn_wes/soft/SOAPnuke1.5.6 filter -n 0.1 --qualRate 0.5 --lowQual 12 -Q 2 -E 35 -G \
-1 ~/learn_wes/data_ren/CL100072545_L01_44_1.fq.gz \
-2 ~/learn_wes/data_ren/CL100072545_L01_44_2.fq.gz \
-f AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA \
-r AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG \
-M 2 -o ~/learn_wes/data_ren/ \
-C L01_44_1.fq.gz \
-D L01_44_2.fq.gz
#在1.5.6版本中,這些參數(shù)是這個意思
-n, --nRate : <f> N rate threshold (default: [0.05])
-q, --qualRate : <f> low quality rate (default: [0.5])
-l, --lowQual : <i> low quality threshold (default: [5])
-Q, --qualSys : <i> quality system 1:illumina, 2:sanger (default: [ 1 ])
-E, --cutAdaptor: <i> cut sequence from adaptor index,unless performed -f/-r also in use
discard the read when the adaptor index of the read is less than INT
-G, --sanger : <b> set clean data qualtiy system to sanger (default: illumina)
-f, --adapter1 : <s> 3' adapter sequence of fq1 file
-r, --adapter2 : <s> 5' adapter sequence of fq2 file [only for PE reads]
-M, --misMatch : <i> the max mismatch number when match the adapter (default: [1])
#最終除了生成clean data還會生成8個質(zhì)量報告文件,相當貼心了
Base_distributions_by_read_position_1.txt
Base_distributions_by_read_position_2.txt
Base_quality_value_distribution_by_read_position_1.txt
Base_quality_value_distribution_by_read_position_2.txt
Basic_Statistics_of_Sequencing_Quality.txt
Distribution_of_Q20_Q30_bases_by_read_position_1.txt
Distribution_of_Q20_Q30_bases_by_read_position_2.txt
Statistics_of_Filtered_Reads.txt
看看8個質(zhì)量報告文件
Base_distributions_by_read_position_1.txt
Base_distributions_by_read_position_2.txt
這兩個文件中存儲的是,fq1和fq2文件在過濾前后一條read上每一個位置ATGC四種堿基的占比(綜合所有reads的一個統(tǒng)計值)。
Pos A C G T N Clean A Clean C Clean G Clean T Clean N
1 21.60% 37.75% 25.16% 15.48% 0.01% 21.60% 37.75% 25.16% 15.48% 0.01%
2 19.07% 12.22% 20.45% 48.25% 0.01% 19.07% 12.22% 20.45% 48.25% 0.00%
3 24.01% 22.27% 22.08% 31.63% 0.00% 24.02% 22.27% 22.08% 31.63% 0.00%
4 37.14% 19.32% 19.02% 24.51% 0.01% 37.14% 19.32% 19.02% 24.51% 0.00%
5 34.61% 18.92% 21.84% 24.63% 0.00% 34.61% 18.92% 21.84% 24.63% 0.00%
Base_quality_value_distribution_by_read_position_1.txt
Base_quality_value_distribution_by_read_position_2.txt
fq1和fq2文件在過濾前后每條read每一個位置上的質(zhì)量值分布情況。
Pos Q0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q31 Q32 Q33 Q34 Q35 Q36 Q37 Q38 Q39 Q40 Q41 Mean Median Lower quartile Upper quartile 10thpercentile 90thpercentile
1 6506 0 0 0 5634 8034 10520 10780 14115 14977 25066 30361 30236 57415 62044 77249 133447 113520 147473 227156 354561 431683 418184 624258 607520 709456 811158 845320 798457 1107895 1135951 1248682 1316381 2005659 2227114 3966695 9120155 39936510 1676084 0 0 0 34.86 37.00 35.00 37.00 29.00 37.00
...
Clean Quality Value Distribute
Pos Q0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q31 Q32 Q33 Q34 Q35 Q36 Q37 Q38 Q39 Q40 Q41 Mean Median Lower quartile Upper quartile 10thpercentile 90thpercentile
1 4825 0 0 0 5532 8013 10489 10738 14075 14905 24958 30242 30078 57214 61856 77000 133019 113168 146928 226490 353320 430205 416898 622189 605782 707357 808742 843046 796149 1104558 1132645 1244984 1312412 1999461 2220267 3955090 9096624 39860838 1673117 0 0 0 34.86 37.00 35.00 37.00 29.00 37.00
...
Basic_Statistics_of_Sequencing_Quality.txt
測序質(zhì)量的基本統(tǒng)計,也是fq1,fq2過濾前后一起比較,具體包括這些項:
Item
Read length
Total number of reads
Number of filtered reads (%)
Total number of bases
Number of filtered bases (%)
Reads related to Adapter and Trimmed (%)
Number of base A (%)
Number of base C (%)
Number of base G (%)
Number of base T (%)
Number of base N (%)
Number of base calls with quality value of 20 or higher (Q20+) (%)
Number of base calls with quality value of 30 or higher (Q30+) (%)
Distribution_of_Q20_Q30_bases_by_read_position_1.txt
Distribution_of_Q20_Q30_bases_by_read_position_2.txt
分別是fq1,fq2過濾前后read每一個位置上Q20, Q30的堿基所占百分比。
Position in reads Percentage of Q20+ bases Percentage of Q30+ bases Percentage of Clean Q20+ Percentage of Clean Q30+
1 98.61% 89.07% 98.62% 89.08%
2 97.95% 87.35% 97.95% 87.36%
Statistics_of_Filtered_Reads.txt
存放的是過濾掉的reads的一些信息,可以清楚地看到是因為什么而被過濾的。
Item Total
Total filtered reads (%) 326084
Reads with adapter (%) 6016
Reads with low quality (%) 86372
Reads with low mean quality (%) 0
Reads with duplications (%) 0
Read with n rate exceed: (%) 233696
Read with small insert size: (%) 0
Reads with PolyA (%) 0 ......
我個人覺得Basic_Statistics_of_Sequencing_Quality.txt和Statistics_of_Filtered_Reads.txt比較重要。