教程在：腫瘤外顯子數(shù)據(jù)處理系列教程（二）質(zhì)控與去接頭 (qq.com)

去接頭

1.安裝TrimGalore

curl https://codeload.github.com/FelixKrueger/TrimGalore/zip/0.6.1 TrimGalore.zip
unzip TrimGalore.zip
ln -s ~/tools/TrimGalore-0.6.1/trim_galore ~/tools/bin/trim_galore

在查找這個(gè)軟件的時(shí)候發(fā)現(xiàn)了一個(gè)博主介紹的很好，此處復(fù)制一點(diǎn)自己覺(jué)得有用的

Trim Galore是對(duì)FastQC和Cutadapt的包裝。適用于所有高通量測(cè)序，包括RRBS(Reduced Representation Bisulfite-Seq ), Illumina、Nextera 和smallRNA測(cè)序平臺(tái)的雙端和單端數(shù)據(jù)。主要功能包括兩步：
第一步首先去除低質(zhì)量堿基，然后去除3' 末端的adapter, 如果沒(méi)有指定具體的adapter，程序會(huì)自動(dòng)檢測(cè)前1million的序列，然后對(duì)比前12-13bp的序列是否符合以下類型的adapter:

Illumina: AGATCGGAAGAGC
Small RNA: TGGAATTCTCGG
Nextera: CTGTCTCTTATA

但是這個(gè)作者并沒(méi)有用conda安裝成功

(wes) 16:45:33 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
conda install trim-galore
Collecting package metadata (current_repodata.json): -
 fontconfig         conda-forge/linux-64::fontconfig-2.14.0-h8e229c2_0
  freetype           conda-forge/linux-64::freetype-2.10.4-h0708190_1
  libnsl             conda-forge/linux-64::libnsl-2.0.0-h7f98852_0
  libpng             conda-forge/linux-64::libpng-1.6.37-h21135ba_2
  libuuid            conda-forge/linux-64::libuuid-2.32.1-h7f98852_1000
  openjdk            conda-forge/linux-64::openjdk-8.0.312-h7f98852_0
  perl               conda-forge/linux-64::perl-5.32.1-2_h7f98852_perl5
  pigz               conda-forge/linux-64::pigz-2.6-h27826a3_0
  trim-galore        bioconda/noarch::trim-galore-0.6.7-hdfd78af_0
  xopen              bioconda/noarch::xopen-0.7.3-py_0


Proceed ([y]/n)? y
Downloading and Extracting Packages
pigz-2.6             | 87 KB     | ##################################### | 100%
fontconfig-2.14.0    | 305 KB    | ##################################### | 100%
xopen-0.7.3          | 11 KB     | ##################################### | 100%
fastqc-0.11.9        | 9.7 MB    | ##################################### | 100%
perl-5.32.1          | 14.4 MB   | ##################################### | 100%
freetype-2.10.4      | 890 KB    | ##################################### | 100%
libnsl-2.0.0         | 31 KB     | ##################################### | 100%
libpng-1.6.37        | 306 KB    | ##################################### | 100%
cutadapt-1.18        | 199 KB    | ##################################### | 100%
expat-2.4.8          | 187 KB    | ##################################### | 100%
openjdk-8.0.312      | 97.6 MB   | ##################################### | 100%
bz2file-0.98         | 9 KB      | ##################################### | 100%
libuuid-2.32.1       | 28 KB     | ##################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

可以看到依賴的環(huán)境很多，要一個(gè)一個(gè)下載

(wes) 17:26:08 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
trim_galore --help

···
 within the other read).
                        NOTE: If you are planning to use Bowtie2, BWA etc. you don't need to specify this option.

--retain_unpaired       If only one of the two paired-end reads became too short, the longer
                        read will be written to either '.unpaired_1.fq' or '.unpaired_2.fq'
                        output files. The length cutoff for unpaired single-end reads is
                        governed by the parameters -r1/--length_1 and -r2/--length_2. Default: OFF.

-r1/--length_1 <INT>    Unpaired single-end read length cutoff needed for read 1 to be written to
                        '.unpaired_1.fq' output file. These reads may be mapped in single-end mode.
                        Default: 35 bp.

-r2/--length_2 <INT>    Unpaired single-end read length cutoff needed for read 2 to be written to
                        '.unpaired_2.fq' output file. These reads may be mapped in single-end mode.
                        Default: 35 bp.

Last modified on 07 October 2020.

安裝成功

(wes) 17:26:08 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cutadapt --help
cutadapt version 1.18

Copyright (C) 2010-2018 Marcel Martin <marcel.martin@scilifelab.se>

cutadapt removes adapter sequences from high-throughput sequencing reads.

Usage:
    cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

For paired-end reads:
    cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq

Replace "ADAPTER" with the actual sequence of your 3' adapter. IUPAC wildcard
characters are supported. The reverse complement is *not* automatically
searched. All reads from input.fastq will be written to output.fastq with the
adapter sequence removed. Adapter matching is error-tolerant. Multiple adapter
sequences can be given (use further -a options), but only the best-matching
adapter will be removed.

Input may also be in FASTA format. Compressed input and output is supported and
auto-detected from the file name (.gz, .xz, .bz2). Use the file name '-' for
standard input/output. Without the -o option, output is sent to standard output.

Citation:

Marcel Martin. Cutadapt removes adapter sequences from high-throughput
sequencing reads. EMBnet.Journal, 17(1):10-12, May 2011.
http://dx.doi.org/10.14806/ej.17.1.200

Run "cutadapt --help" to see all command-line options.
See https://cutadapt.readthedocs.io/ for full documentation.

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  --debug               Print debugging information.
  ···

但是這個(gè)版本不得行

(wes) 17:28:33 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cutadapt --version
1.18

作者那時(shí)候2019年都2.6了，現(xiàn)在會(huì)更新
更新一下

(wes) 17:29:28 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
pip3 install --user --upgrade cutadapt
Looking in indexes: http://pypi.douban.com/simple

更新失敗了，算了算了，我的pip總感覺(jué)這個(gè)環(huán)境的有點(diǎn)問(wèn)題

term_galore運(yùn)行的參數(shù)

image.png

2.去接頭

這是教程的代碼：

cd 1.raw_fq
touch ../3.clean/trimgalore.log

## trim_galore.sh
cat ../tmp | while read id; do
    fq1=${id}_1.fastq.gz
    fq2=${id}_2.fastq.gz
    trim_galore  --paired -q 28 --phred33 --length 30 --stringency 3 --gzip --cores 8 -o ../3.clean  $fq1  $fq2 >> trimgalore.log 2>&1
done

nohup bash trim_galore.sh &

multiqc ./ -n trim_galore_report -p -i " Trim REPORT OF SRP070662" -o multiqc

nohup fastqc --outdir ../2.qc/post --threads 16 *.fq.gz > ../2.qc/post/fastqc.log 2>&1

multiqc ./ -n post_qc_report -p -i " QC REPORT OF SRP070662" -o multiqc

這是教程的，自己改腳本

cat trim_galore.sh
cd /data1/jiarongf/learning/cancer-WES/0.sra/raw_data

## trim_golore.sh
cat ../../data/case | while read id; do
    fq1=${id}_1.fastq.gz
    fq2=${id}_2.fastq.gz
    trim_galore  --paired -q 28 --phred33 --length 30 --stringency 3 --gzip --cores 8 -o ../../3.clean  $fq1  $fq2 > ../../log/trimgalore.log 2>&1
done

這里報(bào)錯(cuò)了，試一下跑一個(gè)，發(fā)現(xiàn)試python的問(wèn)題，之前創(chuàng)建的wes環(huán)境是2的版本，所以跑完之后會(huì)報(bào)錯(cuò)

Writing final adapter and quality trimmed output to case3_techrep_2_WES_1_trimmed.fq.gz


  >>> Now performing quality (cutoff '-q 28') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file case3_techrep_2_WES_1.fastq.gz <<<
ERROR: Running in parallel is not supported on Python 2


Cutadapt terminated with exit signal: '256'.
Terminating Trim Galore run, please check error message(s) to get an idea what went wrong...

現(xiàn)在更改已經(jīng)創(chuàng)建的環(huán)境的python版本

image.png

conda install python==版本號(hào)

但是有一個(gè)奇怪的就是再這個(gè)環(huán)境里面的conda show config是python是3.8.8的，為什么默認(rèn)的python的版本就是2的，哎可能是創(chuàng)建環(huán)境的問(wèn)題吧
網(wǎng)有點(diǎn)不好，慢慢等待

image.png

Proceed ([y]/n)? y


Downloading and Extracting Packages
cutadapt-4.0         | 208 KB    | ##################################### | 100%
pbzip2-1.1.13        | 114 KB    | ##################################### | 100%
certifi-2021.10.8    | 145 KB    | ##################################### | 100%
dnaio-0.8.1          | 76 KB     | ##################################### | 100%
python-isal-0.11.1   | 141 KB    | ##################################### | 100%
setuptools-62.1.0    | 1.2 MB    | ##################################### | 100%
xopen-1.5.0          | 25 KB     | ##################################### | 100%
isa-l-2.30.0         | 192 KB    | ##################################### | 100%
python-3.8.8         | 26.1 MB   | ##################################### | 100%
pip-22.0.4           | 1.5 MB    | ##################################### | 100%
python_abi-3.8       | 4 KB      | ##################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

下載成功
并且也更新了cutadapt

(wes) 14:46:05 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra/raw_data
$
cutadapt -v
This is cutadapt 4.0 with Python 3.8.8
Command line parameters: -v
Run "cutadapt --help" to see command-line options.
See https://cutadapt.readthedocs.io/ for full documentation.

cutadapt: error: unrecognized arguments: -v
(wes) 14:46:11 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra/raw_data
$
python --version
Python 3.8.8

再試著運(yùn)行一下

(wes) 14:47:29 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra/raw_data
$
trim_galore  --paired -q 28 --phred33 --length 30 --stringency 3 --gzip --cores 8 -o ../../3.clean  case3_techrep_2_WES_1.fastq.gz  case3_techrep_2_WES_2.fastq.gz

Number of cores used for trimming: 8
Quality Phred score cutoff: 28
Quality encoding type selected: ASCII+33
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 3 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 30 bp
Output file(s) will be GZIP compressed

Cutadapt seems to be fairly up-to-date (version 4.0). Setting -j 8
Writing final adapter and quality trimmed output to case3_techrep_2_WES_1_trimmed.fq.gz


  >>> Now performing quality (cutoff '-q 28') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file case3_techrep_2_WES_1.fastq.gz <<<
10000000 sequences processed
^C
(wes) 14:47:29 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra/raw_data

全部拿去跑試試
這里提醒，每次拿去跑的時(shí)候都要?jiǎng)h掉之前的，不然不知道新生成的是哪個(gè)

(wes) 14:50:06 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
nohup sh trim_galore.sh > trim_galore.sh.log 2>&1 &
(wes) 14:50:06 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
top

image.png

好了再后臺(tái)跑起來(lái)了
去看看log有沒(méi)有問(wèn)題，這里我生成了兩個(gè)log，一個(gè)是運(yùn)行sh的log一個(gè)是運(yùn)行trim_galore的log
先去檢查sh的log

image.png

嗯呢果然很簡(jiǎn)潔，沒(méi)啥
在檢查另一個(gè)log

image.png

成功的標(biāo)志啊，哈哈哈哈哈，下面有再檢查每一個(gè)adapter了

也要跑很久，趁著跑的這段時(shí)間，去學(xué)習(xí)一下term_galore 發(fā)現(xiàn)這里不同的測(cè)序平臺(tái)會(huì)有不同的adaper，那么如何知道這批數(shù)據(jù)是什么平臺(tái)呢，生成fq文件之后不是做了一次fastqc質(zhì)控嘛，隨意打開(kāi)一個(gè)

image.png

就可以看到這個(gè)樣本的encoding了，

學(xué)習(xí)trim_galore

教程：使用trim_galore對(duì)數(shù)據(jù)進(jìn)行質(zhì)量控制-過(guò)濾 - 簡(jiǎn)書 (jianshu.com)

image.png

參數(shù)說(shuō)明

--quality/-q<INT>：設(shè)定Phred quality score閾值，默認(rèn)為Phred 20 切除質(zhì)量得分低于設(shè)定值的序列
--phred33/64：使用ASCII+33/64質(zhì)量得分作為Phred得分選擇-phred33或者-phred64，表示-測(cè)序平臺(tái)使用的Phred quality score。
（需要確認(rèn):anger/Illumina 1.9+ encoding為33; Illumina 1.5 encoding為64）
--adapter/-a ：輸入adapter序列。也可以不輸入，Trim Galore!會(huì)自動(dòng)尋找可能性最高的平臺(tái)對(duì)應(yīng)的adapter。自動(dòng)搜選的平臺(tái)三個(gè)，也直接顯式輸入這三種平臺(tái)，即--illumina、--nextera和--small_rna。
-s/--stringency<INT>：設(shè)定可以忍受的前后adapter重疊的堿基數(shù)，默認(rèn)為1（非?？量蹋？梢赃m度放寬，因?yàn)楹笠粋€(gè)adapter幾乎不可能被測(cè)序儀讀到。
--length<INT>：設(shè)定輸出reads長(zhǎng)度閾值小于設(shè)定值會(huì)被拋棄，默認(rèn)值20bp；即小于20bp的被去除。注意，在pe150下，不要設(shè)置太高，可以50或36（默認(rèn)20）
--max_length : 設(shè)置長(zhǎng)度大于此值被丟棄
-e <ERROR RATE> ：默認(rèn)0.1
--paired：對(duì)于雙端測(cè)序結(jié)果，一對(duì)reads中，如果有一個(gè)被剔除，那么另一個(gè)會(huì)被同樣拋棄，而不管是否達(dá)到標(biāo)準(zhǔn)。
--retain_unpaired：對(duì)于雙端測(cè)序結(jié)果，一對(duì)reads中，如果一個(gè)read達(dá)到標(biāo)準(zhǔn)，但是對(duì)應(yīng)的另一個(gè)要被拋棄，達(dá)到標(biāo)準(zhǔn)的read會(huì)被單獨(dú)保存為一個(gè)文件。
--gzip和--dont_gzip：清洗后的數(shù)據(jù)zip打包或者不打包。
-o/--output_dir：輸入目錄 [需要提前建立目錄，否則運(yùn)行會(huì)報(bào)錯(cuò)]。
-- trim-n : 移除read一端的reads
-j :使用線程數(shù), 注意假設(shè)已使用Python 3并安裝了Pigz，那么內(nèi)核設(shè)置為4，實(shí)際使用內(nèi)核是15，因此，最高設(shè)為4.
--fastqc #當(dāng)分析結(jié)束后，使用默認(rèn)選項(xiàng)對(duì)結(jié)果文件進(jìn)行fastqc分析

fasqc的結(jié)果

image.png

Per base sequence quality : 每個(gè)read各位置堿基的測(cè)序質(zhì)量。
橫軸堿基的位置，縱軸是質(zhì)量分?jǐn)?shù)，
Quality score= -10log10p
（p代表錯(cuò)誤率），所以當(dāng)質(zhì)量分?jǐn)?shù)為40的時(shí)候，p就是0.0001，質(zhì)量算高了。
[紅色線]代表中位數(shù)，
[藍(lán)色線]代表平均數(shù)
[黃色]是25%-75%區(qū)間，
[權(quán)]是10%-90%區(qū)間。
若任一位置的下 [四分位數(shù)] 低于10或者 [中位數(shù)] 低于25---->出現(xiàn) “警告”
若任一位置的下 [四分位數(shù)] 低于5或者 [中位數(shù)] 低于20，出現(xiàn)“失敗，F(xiàn)ail”

跑完啦

昨天跑了一下去接頭和質(zhì)控，質(zhì)控的腳本：

(wes) 08:38:26 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cat post_fastqc.sh
nohup fastqc --outdir ../2.qc/post --threads 16 ../3.clean/*.fq.gz > ../log/post.fastqc.log 2>&1 &

image.png

multi一下

(wes) 08:41:12 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cat post_qc_multiqc.sh
multiqc /data1/jiarongf/learning/cancer-WES/2.qc/post  -n post_qc_report -p -i " QC REPORT OF SRP070662" -o /data1/jiarongf/learning/cancer-WES/2.qc/post/
(wes) 08:41:21 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
sh post_qc_multiqc.sh

  /// MultiQC ?? | v1.13.dev0

|           multiqc | Report title:  QC REPORT OF SRP070662
|           multiqc | Search path : /data1/jiarongf/learning/cancer-WES/2.qc/post
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 120/120
|            fastqc | Found 60 reports
|           multiqc | Compressing plot data
|           multiqc | Report      : ../2.qc/post/post_qc_report.html
|           multiqc | Data        : ../2.qc/post/post_qc_report_data
|           multiqc | Plots       : ../2.qc/post/post_qc_report_plots
|           multiqc | MultiQC complete
(wes) 08:42:07 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$