GWAS走起~(2 fastqc、fastp)

Step four:


? ? 對從公司送回來的原始數(shù)據(jù)raw data預處理,至于問什么要進行這一步,通俗點講就像是炒菜,直接從地里拔出來的青菜是不能直接用的,而正規(guī)的程序,至少要洗洗切切;而洗菜得用盆,切菜得用刀,所以要處理這些raw data也得要找點工具:首先得有個切菜的桌子放菜,而我們要放數(shù)據(jù)而且是大數(shù)據(jù),一般的電腦是肯定不行的,最好是臺服務器,而且要臺擴大內(nèi)存的服務器,后期我會講一講自己理解的該怎那么配一臺能做生信的電腦;其次,就是菜刀了,刀的種類有很多,這里的工具(軟件)也是多種多樣:fastqc、fastp、bwa、samtools、vcftools、gatk、picard、plink~~~~~這里的是我了解的一小部分,據(jù)統(tǒng)計這樣的工具現(xiàn)在有數(shù)千種,但是步驟就這些,它們的功能也都是大同小異,同一功能軟件的差異我以為是側重點有所不同,這個需要細品,但是大眾化的也就我列舉的這幾個,反正對于我這樣的小白白是夠了,我也不想去細品了。

我認為那到菜的第一件事就是看看這菜新不新鮮、能不能吃,數(shù)據(jù)也是這樣,看看這個數(shù)據(jù)合不合格、能不能用。下面就要用到我們的第一把刀了——fastqc,用它看一下“這菜”怎么樣?

哎幺幺!忘了和大家說一個前提,做生信的大多數(shù)軟件最好用Linux系統(tǒng),其實這個系統(tǒng)也很簡單,就和你用慣了Android,現(xiàn)在得用iOS了,多少你的適應一下。其實啊,這里用到的命令也就幾條,想了解,后面我也和你說一下。這其實都不難,如果非要說點難點,我覺著就是得了解Linux下每個軟件參數(shù)怎么用。

噢噢噢噢噢!


沒別的意思就是想提提提神,下面就讓我們一起了解一下fastqc。

Fastqc是一款基于JAVA的軟件,它可以很快的對raw

data 進行數(shù)據(jù)質量評估,也就是“看菜”,讀書人叫他“QC”(quality control),至于問什么這么叫我覺著它就是想讓像我們這樣的小白白第一次看的時候看不懂。

用fastqc可以了解以下信息


測序數(shù)據(jù)的基本信息:也就是basic statistics

每個堿基的質量值:只要大多數(shù)數(shù)據(jù)》20就差不多

每條reads序列的質量值:橫軸是quality,縱軸是reads的數(shù)目,可以大體上了解那一部分質量比較差

每條序列的ATCG組成:A=T,G=C,滿足這個堿基互補配對原則

每條序列N的含量:一般不出現(xiàn)什么東西,有東西出現(xiàn)就是有問題了

GC含量分布:一般是兩條線,一條藍的事理論值,一條紅的不均勻的線,一般跟小狗啃的樣,崎嶇不平,但是模樣應該和理論值差不多,如果偏差太大就是樣本應該是被污染了

每條序列的長度分布:根據(jù)圖可以大體了解哪一塊長度的reads比較多

序列中duplication程度:重復序列是正常的,在提dna時,樣品本來就是有重復的序列,因為,我們不是提的單細胞基因,還有在測序是要進行橋式pcr擴增,用以各堿基擴大信號。但是如果duplication過大就說明了有bias的存在。

其實,還有許多信息,但是,我們這里了解這些就夠了。而且如果有問題fastqc會用叉號提示你的。

知道了fastqc可以做什么,能得到什么,我們下面就了解一下怎么用它吧!

首先下載它,安裝它,然后用它(這部分我補上的)。在我看來Linux的軟件的玩法都差不多,知道它的大體用法,了解一下它的參數(shù)基本就可以操作了。好!盤它!

首先敲一下fastqc –h看裝好了嗎!

FastQC - A high throughputsequence QC analysis tool

?

SYNOPSIS

?

???????? fastqc seqfile1 seqfile2 .. seqfileN

?

??? fastqc [-o output dir] [--(no)extract] [-ffastq|bam|sam]

?????????? [-c contaminant file] seqfile1 ..seqfileN

?

DESCRIPTION

?

??? FastQC reads a set of sequence files andproduces from each one a quality

??? control report consisting of a number ofdifferent modules, each one of

??? which will help to identify a differentpotential type of problem in your

??? data.

???

??? If no files to process are specified on thecommand line then the program

??? will start as an interactive graphicalapplication.? If files are provided

??? on the command line then the program willrun with no user interaction

??? required.?In this mode it is suitable for inclusion into a standardised

??? analysis pipeline.

???

??? The options for the program as as follows:

???

??? -h --help?????? Print this help file and exit

???

??? -v --version??? Print the version of the program and exit

???

??? -o --outdir???? Create all output files in the specifiedoutput directory.

??????????????????? Please note that thisdirectory must exist as the program

??????????????????? will not create it.? If this option is not set then the

??????????????????? output file for eachsequence file is created in the same

??????????????????? directory as the sequencefile which was processed.

???????????????????

??? --casava??????? Files come from raw casava output.Files in the same sample

??????????????????? group (differing only bythe group number) will be analysed

??????????????????? as a set rather thanindividually. Sequences with the filter

??????????????????? flag set in the header willbe excluded from the analysis.

??????????????? ????Files must have the same names given tothem by casava

??????????????????? (including being gzippedand ending with .gz) otherwise they

??????????????????? won't be grouped togethercorrectly.

???????????????????

??? --nano????????? Files come from nanopore sequencesand are in fast5 format. In

??????????????????? this mode you can pass indirectories to process and the program

??????????????????? will take in all fast5files within those directories and produce

??????????????????? a single output file fromthe sequences found in all files.???????????????????

???????????????????

??? --nofilter????? If running with --casava then don'tremove read flagged by

??????????????????? casava as poor quality whenperforming the QC analysis.

??????????????????

??? --extract?????? If set then the zipped output file willbe uncompressed in

??????????????????? the same directory after ithas been created.? By default

??????????????????? this option will be set iffastqc is run in non-interactive

?????????????????? ?mode.

???????????????????

??? -j --java?????? Provides the full path to the javabinary you want to use to

??????????????????? launch fastqc. If notsupplied then java is assumed to be in

??????????????????? your path.

??????????????????

??? --noextract???? Do not uncompress the output file aftercreating it.? You

??????????????????? should set this option ifyou do not wish to uncompress

??????????????????? the output when running innon-interactive mode.

???????????????????

??? --nogroup?????? Disable grouping of bases for reads>50bp. All reports will

??????????????????? show data for every base inthe read.? WARNING: Using this

??????????????????? option will cause fastqc tocrash and burn if you use it on

???????????????? ???really long reads, and your plots may end upa ridiculous size.

??????????????????? You have been warned!

???????????????????

??? --min_length??? Sets an artificial lower limit on thelength of the sequence

??????????????????? to be shown in the report.? As long as you set this to a value

??????????????????? greater or equal to yourlongest read length then this will be

??????????????????? the sequence length used tocreate your read groups.? This can

??????????????????? be useful for makingdirectly comaparable statistics from

??????????????????? datasets with somewhatvariable read lengths.

???????????????????

??? -f --format???? Bypasses the normal sequence file formatdetection and

??????????????????? forces the program to usethe specified format.? Valid

??????????????????? formats arebam,sam,bam_mapped,sam_mapped and fastq

???????????????????

??? -t --threads??? Specifies the number of files which can beprocessed

??????????????????? simultaneously.? Each thread will be allocated 250MB of

????? ??????????????memory so you shouldn't run morethreads than your

??????????????????? available memory will copewith, and not more than

??????????????????? 6 threads on a 32 bitmachine

?????????????????

??? -c????????????? Specifies a non-default file whichcontains the list of

??? --contaminants? contaminants to screen overrepresentedsequences against.

??????????????????? The file must contain setsof named contaminants in the

??????????????????? formname[tab]sequence.? Lines prefixed with ahash will

??????????????????? be ignored.

?

??? -a????????????? Specifies a non-default filewhich contains the list of

??? --adapters????? adapter sequences which will be explicitysearched against

??????????????????? the library. The file mustcontain sets of named adapters

??????????????????? in the formname[tab]sequence.? Lines prefixed with ahash

??????????????????? will be ignored.

???????????????????

??? -l????????????? Specifies a non-default filewhich contains a set of criteria

??? --limits??????? which will be used to determine thewarn/error limits for the

??????????????????? various modules.? This file can also be used to selectively

??????????????????? remove some modules fromthe output all together.? The format

??????????????????? needs to mirror the defaultlimits.txt file found in the

??????????????????? Configuration folder.

???????????????????

?? -k --kmers?????? Specifies the length of Kmer to look forin the Kmer content

???? ???????????????module. Specified Kmer lengthmust be between 2 and 10. Default

??????????????????? length is 7 if notspecified.

???????????????????

?? -q --quiet?????? Supress all progress messages on stdoutand only report errors.

??

?? -d --dir?????? ??Selects a directory to be used for temporaryfiles written when

??????????????????? generating report images.Defaults to system temp directory if

??????????????????? not specified.

???????????????????

BUGS

?

??? Any bugs in fastqc should be reportedeither to simon.andrews@babraham.ac.uk

??? or inwww.bioinformatics.babraham.ac.uk/bugzilla/

??? 然后,你如果得到這些內(nèi)容,它基本就是好了!如果other,就是有問題,自己查一下解決一下,不過也可以私信,一起探討一下。

fastq-o ./ ../reads/example1.*

-o:用來指定輸出文件的所在目錄,注意是不能自動新建目錄的。

-f:用來強制指定輸入文件格式,默認會自動檢測。

-c:用來指定一個contaminant文件,fastqc會把overrepresented sequences往這個contaminant文件里搜索。

contaminant文件的格式是"Name\tSequences",#開頭的行是注釋。

加上 -q 會進入沉默模式,(這個就沒什么必要)

在我看來了解-o 、-f就可以操作了

好!找個文件,嘗試一下

?[hai@localhost~]$ cd ~/proj1/fastqc/

[hai@localhost fastqc]$ fastqc -f fastq -o ./../reads/example1.*

Started analysis of example1.L.fq

Approx 5% complete for example1.L.fq

Approx 10% complete for example1.L.fq

Approx 15% complete for example1.L.fq

Approx 20% complete for example1.L.fq

Approx 25% complete for example1.L.fq

Approx 30% complete for example1.L.fq

Approx 35% complete for example1.L.fq

Approx 40% complete for example1.L.fq

Approx 45% complete for example1.L.fq

Approx 50% complete for example1.L.fq

Approx 55% complete for example1.L.fq

Approx 60% complete for example1.L.fq

Approx 65% complete for example1.L.fq

Approx 70% complete for example1.L.fq

Approx 75% complete for example1.L.fq

Approx 80% complete for example1.L.fq

Approx 85% complete for example1.L.fq

Approx 90% complete for example1.L.fq

Approx 95% complete for example1.L.fq

Analysis complete for example1.L.fq

Started analysis of example1.R.fq

Approx 5% complete for example1.R.fq

Approx 10% complete for example1.R.fq

Approx 15% complete for example1.R.fq

Approx 20% complete for example1.R.fq

Approx 25% complete for example1.R.fq

Approx 30% complete for example1.R.fq

Approx 35% complete for example1.R.fq

Approx 40% complete for example1.R.fq

Approx 45% complete for example1.R.fq

Approx 50% complete for example1.R.fq

Approx 55% complete for example1.R.fq

Approx 60% complete for example1.R.fq

Approx 65% complete for example1.R.fq

Approx 70% complete for example1.R.fq

Approx 75% complete for example1.R.fq

Approx 80% complete for example1.R.fq

Approx 85% complete for example1.R.fq

Approx 90% complete for example1.R.fq

Approx 95% complete for example1.R.fq

Analysis complete for example1.R.fq

歐了,點開文件查看一下!


?????

Fastqc基本就是這些,看那個紅叉叉就是有問題的,有問題,下面就要解決問題了。

如果再跟做飯聯(lián)系起來,fastqc的作用就相當于看看這剛拔出來的菜什么地方臟,下面就是要洗洗切切了,下一個工具是fastp,說到這個工具,我不禁感嘆世界是多么美好啊。

Fastpfastp開發(fā)為具有有用的質量控制和數(shù)據(jù)過濾功能的超快速FASTQ預處理器。只需掃描FASTQ數(shù)據(jù),它即可執(zhí)行質量控制,適配器修整,質量過濾,每次讀取質量修剪和許多其他操作。該工具是用C ++開發(fā)的,具有多線程支持。根據(jù)我們的評估,fastp比其他FASTQ預處理工具(如Trimmomatic或Cutadapt)快2至5倍,盡管執(zhí)行的操作要比類似工具多得多。

具體優(yōu)勢:

對數(shù)據(jù)自動進行全方位質控,生成人性化的報告。過濾功能(低質量,太短,太多N……)。對每一個序列的頭部或尾部,計算滑動窗內(nèi)的質量均值,并將均值較低的子序列進行切除(類似Trimmomatic的做法,但是快非常多)。全局剪裁 (在頭/尾部,不影響去重),對于Illumina下機數(shù)據(jù)往往最后一到兩個cycle需要這樣處理。去除接頭污染。厲害的是,你不用輸入接頭序列,因為算法會自動識別接頭序列并進行剪裁。對于雙端測序(PE)的數(shù)據(jù),軟件會自動查找每一對read的重疊區(qū)域,并對該重疊區(qū)域中不匹配的堿基對進行校正。去除尾部的polyG。對于Illumina

NextSeq/NovaSeq的測序數(shù)據(jù),因為是兩色法發(fā)光,polyG是常有的事,所以該特性對該兩類測序平臺默認打開。對于PE數(shù)據(jù)中的overlap區(qū)間中不一致的堿基對,依據(jù)質量值進行校正可以對帶分子標簽(UMI)的數(shù)據(jù)進行預處理,不管UMI在插入片段還是在index上,都可以輕松處理??梢詫⑤敵鲞M行分拆,而且支持兩種模式,分別是指定分拆的個數(shù),或者分拆后每個文件的行數(shù)。

好了,說了這么多開盤吧!

參數(shù):

-i, --in1 R1文件輸入;


-I, --in2 R2文件輸入;


-o, --out1 R1文件處理后的輸出;


-O, --out2 R2文件處理后的輸出;


-h, --html 設置輸出html格式的質控結果文件名,不設置則默認html文件名為fastp.html


-j, --json 設置輸出html格式的質控結果文件名,不設置則默認json文件名為fastp.json

其實它還有一些參數(shù),這個根據(jù)需要自己查一下稍微改一下就好了,對于小白白來說,一切都是最好的安排,就一切都默認吧!

SE:

fastp -i in.fq -o out.fq

PE:

fastp -i in.R1.fq -o out.R1.fq -I in.R2.fq -O out.R2.fq

#注意大小寫

fastp對于輸入和輸出都支持gzip壓縮,只要文件名的末尾帶有.gz,就會被認為是gzip壓縮文件,會啟用gzip對輸入輸出進行壓縮和解壓處理。

fastp -i in.R1.fq.gz -I in.R2.fq.gz -o out.R1.fq.gz -O out.R2.fq.gz

除此之外,fastp還會給出一個人性化的報告。為啥人性化自己看吧。

經(jīng)過處理后,再用fastqc看一下是否都綠了.

Fstqc –f fastq –o ./ ../data/reads/example.*

哈哈哈,全綠了,紅叉叉也沒有了。

Ok,下一步,序列比對!

?著作權歸作者所有,轉載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容