教程在:腫瘤外顯子數(shù)據(jù)處理系列教程(二)質(zhì)控與去接頭 (qq.com)
去接頭
1.安裝TrimGalore
curl https://codeload.github.com/FelixKrueger/TrimGalore/zip/0.6.1 TrimGalore.zip
unzip TrimGalore.zip
ln -s ~/tools/TrimGalore-0.6.1/trim_galore ~/tools/bin/trim_galore
在查找這個(gè)軟件的時(shí)候發(fā)現(xiàn)了一個(gè)博主介紹的很好,此處復(fù)制一點(diǎn)自己覺(jué)得有用的
Trim Galore是對(duì)FastQC和Cutadapt的包裝。適用于所有高通量測(cè)序,包括RRBS(Reduced Representation Bisulfite-Seq ), Illumina、Nextera 和smallRNA測(cè)序平臺(tái)的雙端和單端數(shù)據(jù)。主要功能包括兩步:
第一步首先去除低質(zhì)量堿基,然后去除3' 末端的adapter, 如果沒(méi)有指定具體的adapter,程序會(huì)自動(dòng)檢測(cè)前1million的序列,然后對(duì)比前12-13bp的序列是否符合以下類型的adapter:
- Illumina: AGATCGGAAGAGC
- Small RNA: TGGAATTCTCGG
- Nextera: CTGTCTCTTATA
但是這個(gè)作者并沒(méi)有用conda安裝成功
(wes) 16:45:33 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
conda install trim-galore
Collecting package metadata (current_repodata.json): -
fontconfig conda-forge/linux-64::fontconfig-2.14.0-h8e229c2_0
freetype conda-forge/linux-64::freetype-2.10.4-h0708190_1
libnsl conda-forge/linux-64::libnsl-2.0.0-h7f98852_0
libpng conda-forge/linux-64::libpng-1.6.37-h21135ba_2
libuuid conda-forge/linux-64::libuuid-2.32.1-h7f98852_1000
openjdk conda-forge/linux-64::openjdk-8.0.312-h7f98852_0
perl conda-forge/linux-64::perl-5.32.1-2_h7f98852_perl5
pigz conda-forge/linux-64::pigz-2.6-h27826a3_0
trim-galore bioconda/noarch::trim-galore-0.6.7-hdfd78af_0
xopen bioconda/noarch::xopen-0.7.3-py_0
Proceed ([y]/n)? y
Downloading and Extracting Packages
pigz-2.6 | 87 KB | ##################################### | 100%
fontconfig-2.14.0 | 305 KB | ##################################### | 100%
xopen-0.7.3 | 11 KB | ##################################### | 100%
fastqc-0.11.9 | 9.7 MB | ##################################### | 100%
perl-5.32.1 | 14.4 MB | ##################################### | 100%
freetype-2.10.4 | 890 KB | ##################################### | 100%
libnsl-2.0.0 | 31 KB | ##################################### | 100%
libpng-1.6.37 | 306 KB | ##################################### | 100%
cutadapt-1.18 | 199 KB | ##################################### | 100%
expat-2.4.8 | 187 KB | ##################################### | 100%
openjdk-8.0.312 | 97.6 MB | ##################################### | 100%
bz2file-0.98 | 9 KB | ##################################### | 100%
libuuid-2.32.1 | 28 KB | ##################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
可以看到依賴的環(huán)境很多,要一個(gè)一個(gè)下載
(wes) 17:26:08 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
trim_galore --help
···
within the other read).
NOTE: If you are planning to use Bowtie2, BWA etc. you don't need to specify this option.
--retain_unpaired If only one of the two paired-end reads became too short, the longer
read will be written to either '.unpaired_1.fq' or '.unpaired_2.fq'
output files. The length cutoff for unpaired single-end reads is
governed by the parameters -r1/--length_1 and -r2/--length_2. Default: OFF.
-r1/--length_1 <INT> Unpaired single-end read length cutoff needed for read 1 to be written to
'.unpaired_1.fq' output file. These reads may be mapped in single-end mode.
Default: 35 bp.
-r2/--length_2 <INT> Unpaired single-end read length cutoff needed for read 2 to be written to
'.unpaired_2.fq' output file. These reads may be mapped in single-end mode.
Default: 35 bp.
Last modified on 07 October 2020.
安裝成功
(wes) 17:26:08 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cutadapt --help
cutadapt version 1.18
Copyright (C) 2010-2018 Marcel Martin <marcel.martin@scilifelab.se>
cutadapt removes adapter sequences from high-throughput sequencing reads.
Usage:
cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq
For paired-end reads:
cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq
Replace "ADAPTER" with the actual sequence of your 3' adapter. IUPAC wildcard
characters are supported. The reverse complement is *not* automatically
searched. All reads from input.fastq will be written to output.fastq with the
adapter sequence removed. Adapter matching is error-tolerant. Multiple adapter
sequences can be given (use further -a options), but only the best-matching
adapter will be removed.
Input may also be in FASTA format. Compressed input and output is supported and
auto-detected from the file name (.gz, .xz, .bz2). Use the file name '-' for
standard input/output. Without the -o option, output is sent to standard output.
Citation:
Marcel Martin. Cutadapt removes adapter sequences from high-throughput
sequencing reads. EMBnet.Journal, 17(1):10-12, May 2011.
http://dx.doi.org/10.14806/ej.17.1.200
Run "cutadapt --help" to see all command-line options.
See https://cutadapt.readthedocs.io/ for full documentation.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
--debug Print debugging information.
···
但是這個(gè)版本不得行
(wes) 17:28:33 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cutadapt --version
1.18
作者那時(shí)候2019年都2.6了,現(xiàn)在會(huì)更新
更新一下
(wes) 17:29:28 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
pip3 install --user --upgrade cutadapt
Looking in indexes: http://pypi.douban.com/simple
更新失敗了,算了算了,我的pip總感覺(jué)這個(gè)環(huán)境的有點(diǎn)問(wèn)題
term_galore運(yùn)行的參數(shù)

2.去接頭
這是教程的代碼:
cd 1.raw_fq
touch ../3.clean/trimgalore.log
## trim_galore.sh
cat ../tmp | while read id; do
fq1=${id}_1.fastq.gz
fq2=${id}_2.fastq.gz
trim_galore --paired -q 28 --phred33 --length 30 --stringency 3 --gzip --cores 8 -o ../3.clean $fq1 $fq2 >> trimgalore.log 2>&1
done
nohup bash trim_galore.sh &
multiqc ./ -n trim_galore_report -p -i " Trim REPORT OF SRP070662" -o multiqc
nohup fastqc --outdir ../2.qc/post --threads 16 *.fq.gz > ../2.qc/post/fastqc.log 2>&1
multiqc ./ -n post_qc_report -p -i " QC REPORT OF SRP070662" -o multiqc
這是教程的,自己改腳本
cat trim_galore.sh
cd /data1/jiarongf/learning/cancer-WES/0.sra/raw_data
## trim_golore.sh
cat ../../data/case | while read id; do
fq1=${id}_1.fastq.gz
fq2=${id}_2.fastq.gz
trim_galore --paired -q 28 --phred33 --length 30 --stringency 3 --gzip --cores 8 -o ../../3.clean $fq1 $fq2 > ../../log/trimgalore.log 2>&1
done
這里報(bào)錯(cuò)了,試一下跑一個(gè),發(fā)現(xiàn)試python的問(wèn)題,之前創(chuàng)建的wes環(huán)境是2的版本,所以跑完之后會(huì)報(bào)錯(cuò)
Writing final adapter and quality trimmed output to case3_techrep_2_WES_1_trimmed.fq.gz
>>> Now performing quality (cutoff '-q 28') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file case3_techrep_2_WES_1.fastq.gz <<<
ERROR: Running in parallel is not supported on Python 2
Cutadapt terminated with exit signal: '256'.
Terminating Trim Galore run, please check error message(s) to get an idea what went wrong...
現(xiàn)在更改已經(jīng)創(chuàng)建的環(huán)境的python版本

conda install python==版本號(hào)
但是有一個(gè)奇怪的就是再這個(gè)環(huán)境里面的conda show config是python是3.8.8的,為什么默認(rèn)的python的版本就是2的,哎可能是創(chuàng)建環(huán)境的問(wèn)題吧
網(wǎng)有點(diǎn)不好,慢慢等待

Proceed ([y]/n)? y
Downloading and Extracting Packages
cutadapt-4.0 | 208 KB | ##################################### | 100%
pbzip2-1.1.13 | 114 KB | ##################################### | 100%
certifi-2021.10.8 | 145 KB | ##################################### | 100%
dnaio-0.8.1 | 76 KB | ##################################### | 100%
python-isal-0.11.1 | 141 KB | ##################################### | 100%
setuptools-62.1.0 | 1.2 MB | ##################################### | 100%
xopen-1.5.0 | 25 KB | ##################################### | 100%
isa-l-2.30.0 | 192 KB | ##################################### | 100%
python-3.8.8 | 26.1 MB | ##################################### | 100%
pip-22.0.4 | 1.5 MB | ##################################### | 100%
python_abi-3.8 | 4 KB | ##################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
下載成功
并且也更新了cutadapt
(wes) 14:46:05 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra/raw_data
$
cutadapt -v
This is cutadapt 4.0 with Python 3.8.8
Command line parameters: -v
Run "cutadapt --help" to see command-line options.
See https://cutadapt.readthedocs.io/ for full documentation.
cutadapt: error: unrecognized arguments: -v
(wes) 14:46:11 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra/raw_data
$
python --version
Python 3.8.8
再試著運(yùn)行一下
(wes) 14:47:29 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra/raw_data
$
trim_galore --paired -q 28 --phred33 --length 30 --stringency 3 --gzip --cores 8 -o ../../3.clean case3_techrep_2_WES_1.fastq.gz case3_techrep_2_WES_2.fastq.gz
Number of cores used for trimming: 8
Quality Phred score cutoff: 28
Quality encoding type selected: ASCII+33
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 3 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 30 bp
Output file(s) will be GZIP compressed
Cutadapt seems to be fairly up-to-date (version 4.0). Setting -j 8
Writing final adapter and quality trimmed output to case3_techrep_2_WES_1_trimmed.fq.gz
>>> Now performing quality (cutoff '-q 28') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file case3_techrep_2_WES_1.fastq.gz <<<
10000000 sequences processed
^C
(wes) 14:47:29 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra/raw_data
全部拿去跑試試
這里提醒,每次拿去跑的時(shí)候都要?jiǎng)h掉之前的,不然不知道新生成的是哪個(gè)
(wes) 14:50:06 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
nohup sh trim_galore.sh > trim_galore.sh.log 2>&1 &
(wes) 14:50:06 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
top

好了再后臺(tái)跑起來(lái)了
去看看log有沒(méi)有問(wèn)題,這里我生成了兩個(gè)log,一個(gè)是運(yùn)行sh的log一個(gè)是運(yùn)行trim_galore的log
先去檢查sh的log

嗯呢果然很簡(jiǎn)潔,沒(méi)啥
在檢查另一個(gè)log

成功的標(biāo)志啊,哈哈哈哈哈,下面有再檢查每一個(gè)adapter了
也要跑很久,趁著跑的這段時(shí)間,去學(xué)習(xí)一下term_galore 發(fā)現(xiàn)這里不同的測(cè)序平臺(tái)會(huì)有不同的adaper,那么如何知道這批數(shù)據(jù)是什么平臺(tái)呢,生成fq文件之后不是做了一次fastqc質(zhì)控嘛,隨意打開(kāi)一個(gè)

就可以看到這個(gè)樣本的encoding了,
學(xué)習(xí)trim_galore
教程:使用trim_galore對(duì)數(shù)據(jù)進(jìn)行質(zhì)量控制-過(guò)濾 - 簡(jiǎn)書 (jianshu.com)


參數(shù)說(shuō)明
--quality/-q<INT>:設(shè)定Phred quality score閾值,默認(rèn)為Phred 20 切除質(zhì)量得分低于設(shè)定值的序列
--phred33/64:使用ASCII+33/64質(zhì)量得分作為Phred得分選擇-phred33或者-phred64,表示-測(cè)序平臺(tái)使用的Phred quality score。
(需要確認(rèn):anger/Illumina 1.9+ encoding為33; Illumina 1.5 encoding為64)
--adapter/-a :輸入adapter序列。也可以不輸入,Trim Galore!會(huì)自動(dòng)尋找可能性最高的平臺(tái)對(duì)應(yīng)的adapter。自動(dòng)搜選的平臺(tái)三個(gè),也直接顯式輸入這三種平臺(tái),即--illumina、--nextera和--small_rna。
-s/--stringency<INT>:設(shè)定可以忍受的前后adapter重疊的堿基數(shù),默認(rèn)為1(非??量蹋?梢赃m度放寬,因?yàn)楹笠粋€(gè)adapter幾乎不可能被測(cè)序儀讀到。
--length<INT>:設(shè)定輸出reads長(zhǎng)度閾值小于設(shè)定值會(huì)被拋棄,默認(rèn)值20bp;即小于20bp的被去除。注意,在pe150下,不要設(shè)置太高,可以50或36(默認(rèn)20)
--max_length : 設(shè)置長(zhǎng)度大于此值被丟棄
-e <ERROR RATE> :默認(rèn)0.1
--paired:對(duì)于雙端測(cè)序結(jié)果,一對(duì)reads中,如果有一個(gè)被剔除,那么另一個(gè)會(huì)被同樣拋棄,而不管是否達(dá)到標(biāo)準(zhǔn)。
--retain_unpaired:對(duì)于雙端測(cè)序結(jié)果,一對(duì)reads中,如果一個(gè)read達(dá)到標(biāo)準(zhǔn),但是對(duì)應(yīng)的另一個(gè)要被拋棄,達(dá)到標(biāo)準(zhǔn)的read會(huì)被單獨(dú)保存為一個(gè)文件。
--gzip和--dont_gzip:清洗后的數(shù)據(jù)zip打包或者不打包。
-o/--output_dir:輸入目錄 [需要提前建立目錄,否則運(yùn)行會(huì)報(bào)錯(cuò)]。
-- trim-n : 移除read一端的reads
-j :使用線程數(shù), 注意假設(shè)已使用Python 3并安裝了Pigz,那么內(nèi)核設(shè)置為4,實(shí)際使用內(nèi)核是15,因此,最高設(shè)為4.
--fastqc #當(dāng)分析結(jié)束后,使用默認(rèn)選項(xiàng)對(duì)結(jié)果文件進(jìn)行fastqc分析
fasqc的結(jié)果

Per base sequence quality : 每個(gè)read各位置堿基的測(cè)序質(zhì)量。
橫軸堿基的位置,縱軸是質(zhì)量分?jǐn)?shù),
Quality score= -10log10p
(p代表錯(cuò)誤率),所以當(dāng)質(zhì)量分?jǐn)?shù)為40的時(shí)候,p就是0.0001,質(zhì)量算高了。
[紅色線]代表中位數(shù),
[藍(lán)色線]代表平均數(shù)
[黃色]是25%-75%區(qū)間,
[權(quán)]是10%-90%區(qū)間。
若任一位置的下 [四分位數(shù)] 低于10或者 [中位數(shù)] 低于25---->出現(xiàn) “警告”
若任一位置的下 [四分位數(shù)] 低于5或者 [中位數(shù)] 低于20,出現(xiàn)“失敗,F(xiàn)ail”
跑完啦
昨天跑了一下去接頭和質(zhì)控,質(zhì)控的腳本:
(wes) 08:38:26 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cat post_fastqc.sh
nohup fastqc --outdir ../2.qc/post --threads 16 ../3.clean/*.fq.gz > ../log/post.fastqc.log 2>&1 &

multi一下
(wes) 08:41:12 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cat post_qc_multiqc.sh
multiqc /data1/jiarongf/learning/cancer-WES/2.qc/post -n post_qc_report -p -i " QC REPORT OF SRP070662" -o /data1/jiarongf/learning/cancer-WES/2.qc/post/
(wes) 08:41:21 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
sh post_qc_multiqc.sh
/// MultiQC ?? | v1.13.dev0
| multiqc | Report title: QC REPORT OF SRP070662
| multiqc | Search path : /data1/jiarongf/learning/cancer-WES/2.qc/post
| searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 120/120
| fastqc | Found 60 reports
| multiqc | Compressing plot data
| multiqc | Report : ../2.qc/post/post_qc_report.html
| multiqc | Data : ../2.qc/post/post_qc_report_data
| multiqc | Plots : ../2.qc/post/post_qc_report_plots
| multiqc | MultiQC complete
(wes) 08:42:07 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$

結(jié)果分析:


有看到dup的比例確實(shí)有減少
M seqs也有相應(yīng)的降低


看到質(zhì)量有明顯的提升









