教程在:腫瘤外顯子數(shù)據(jù)處理系列教程(二)質(zhì)控與去接頭 (qq.com)
安裝軟件
這里要用到fastqc和python,先檢查自己有沒有
fastqc -h

如果沒有就要按照教程進(jìn)行安裝,如下:
## fastqc使用多線程要用--threads
wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.8.zip
unizp fastqc_v0.11.8.zip
ln -s ~/tools/FastQC/fastqc ~/tools/bin/fastqc
mkdir -p 2.qc/{pre, post}
nohup fastqc --outdir ../2.qc/pre --threads 16 *.fastq.gz > ../2.qc/pre/fastqc.log 2>&1 &
## 需要先安裝python
## sudo apt install libffi-dev
cd
cd tools/
mkdir python
wget https://www.python.org/ftp/python/3.7.3/Python-3.7.3.tgz
tar zxvf Python-3.7.3.tgz
cd Python-3.7.3
./configure --prefix="/home/llwu/tools/python/"
make
make install
cd ../python/bin
ls * | while read id
do
ln -s ~/tools/python/bin/${id} ~/tools/bin/${id}
done
這里的鏈接可能也是過時(shí)的,需要去找新的進(jìn)行下載
我檢查了一下我的python的版本是
(/home/jiarongf/my-envs/wes) 10:11:01 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra
$
python --version
Python 2.7.15
不知道后面是否會(huì)有影響,結(jié)果發(fā)現(xiàn)自己的安裝實(shí)在home下的環(huán)境,所以趕緊去更改了一下,具體可以看我發(fā)布的一篇更改虛擬環(huán)境位置的文章,
更改之后就python就是3.8的了
(/home/jiarongf/my-envs/wes) 10:36:29 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra
$
python --version
Python 3.8.8
更改了之后記得重新進(jìn)行wes的環(huán)境
source activate wes
(wes) 11:18:00 jiarongf@172.16.10.223:/data1/jiarongf
還會(huì)用到MultiQC,進(jìn)行安裝,也是直接安裝在run文件夾下
下面是安裝的腳本,可以分開運(yùn)行
curl -LOk https://github.com/ewels/MultiQC/archive/master.zip
unzip master.zip
cd MultiQC-master
python setup.py install
下載了之后才發(fā)現(xiàn)這個(gè)可以用pip或者conda安裝
Installation
You can install MultiQC from PyPI
using pip as follows:
bash
pip install multiqc
Alternatively, you can install using Conda
from the bioconda channel:
conda install -c bioconda multiqc
If you would like the development version instead, the command is:
pip install --upgrade --force-reinstall git+https://github.com/ewels/MultiQC.git
但是此處我已經(jīng)下載了,所以直接安裝就好了
python setup.py install

檢測(cè)安裝是否成功
multiqc -h

去接頭前質(zhì)控
首先需要把之前生成的fastq的文件放到一個(gè)目錄下,定義為raw_data

(wes) 15:48:37 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
vim pre_qc_report.sh
(wes) 15:49:31 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
sh pre_qc_report.sh
/// MultiQC ?? | v1.13.dev0
| multiqc | Report title: QC REPORT OF SRP070662
| multiqc | Search path : /data1/jiarongf/learning/cancer-WES/0.sra/raw_data
| searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 60/60
| multiqc | No analysis results found. Cleaning up..
| multiqc | MultiQC complete
(wes) 15:49:51 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cat pre_qc_report.sh
multiqc ../0.sra/raw_data/ -n pre_qc_report -p -i " QC REPORT OF SRP070662" -o multiqc
(wes) 15:52:06 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
sh pre_qc_report.sh > ../log/pre_qc_report.log 2>&1
但是感覺沒有輸出結(jié)果
這里突然感覺應(yīng)該先fastqc,然后再去multi,找到一個(gè)好的教程就是從下載sra數(shù)據(jù),解壓為fastq,在做fastqc在multi,然后合成一個(gè)報(bào)告,還有報(bào)告結(jié)果的解讀都很不錯(cuò)
MultiQC使用 | 碼農(nóng)家園 (codenong.com)
fastqc
淺試一個(gè)
(wes) 15:58:02 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
fastqc /data1/jiarongf/learning/cancer-WES/0.sra/raw_data/case3_techrep_2_WES_2.fastq.gz
Started analysis of case3_techrep_2_WES_2.fastq.gz
Approx 5% complete for case3_techrep_2_WES_2.fastq.gz
Approx 10% complete for case3_techrep_2_WES_2.fastq.gz
Approx 15% complete for case3_techrep_2_WES_2.fastq.gz
Approx 20% complete for case3_techrep_2_WES_2.fastq.gz
Approx 25% complete for case3_techrep_2_WES_2.fastq.gz
Approx 30% complete for case3_techrep_2_WES_2.fastq.gz
Approx 35% complete for case3_techrep_2_WES_2.fastq.gz
Approx 40% complete for case3_techrep_2_WES_2.fastq.gz
Approx 45% complete for case3_techrep_2_WES_2.fastq.gz
Approx 50% complete for case3_techrep_2_WES_2.fastq.gz
Approx 55% complete for case3_techrep_2_WES_2.fastq.gz
Approx 60% complete for case3_techrep_2_WES_2.fastq.gz
Approx 65% complete for case3_techrep_2_WES_2.fastq.gz
Approx 70% complete for case3_techrep_2_WES_2.fastq.gz
Approx 75% complete for case3_techrep_2_WES_2.fastq.gz
Approx 80% complete for case3_techrep_2_WES_2.fastq.gz
Approx 85% complete for case3_techrep_2_WES_2.fastq.gz
Approx 90% complete for case3_techrep_2_WES_2.fastq.gz
Approx 95% complete for case3_techrep_2_WES_2.fastq.gz
Analysis complete for case3_techrep_2_WES_2.fastq.gz
(wes) 16:05:53 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
時(shí)間太久了,建議后臺(tái)跑,再有就是這個(gè)成功的,
跑完以后再原來的目錄生成了

這只是一個(gè)的,所以需要循環(huán)一下,run目錄下寫個(gè)腳本
前面的腳本
pre_qc_report.sh也需要改個(gè)名字
(wes) 16:08:26 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
mv pre_qc_report.sh pre_qc_multiqc.sh
上述做錯(cuò)了,重新搞
去接頭前質(zhì)控
(wes) 16:33:05 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES
$
cd /data1/jiarongf/learning/cancer-WES
(wes) 16:31:37 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES
$
mkdir -p 2.qc/{pre,post}
(wes) 16:33:01 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES
$
l
0.sra/ 2.qc/ data/ log/ run/
(wes) 16:33:02 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES
$
ls 2.qc/
post pre
質(zhì)控
nohup fastqc --outdir ../2.qc/pre --threads 16 *.fastq.gz > ../2.qc/pre/fastqc.log 2>&1 &
(wes) 16:35:55 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
vim pre_fastqc.sh
(wes) 16:41:22 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cat pre_fastqc.sh
nohup fastqc --outdir ../2.qc/pre --threads 16 ../0.sra/raw_data/*.fastq.gz > ../log/pre.fastqc.log 2>&1 &
(wes) 16:41:09 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
sh pre_fastqc.sh
(wes) 16:41:30 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
直接掛到后臺(tái)去,

一共有60個(gè),

16個(gè)16個(gè)的跑,感覺也要跑很久,可以先做去接頭

跑完啦
/data1/jiarongf/learning/cancer-WES/2.qc/pre

可以multi了
10:24:45 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
vim pre_qc_multiqc.sh
10:26:37 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
cat pre_qc_multiqc.sh
multiqc /data1/jiarongf/learning/cancer-WES/2.qc/pre -n pre_qc_report -p -i " QC REPORT OF SRP070662" -o /data1/jiarongf/learning/cancer-WES/2.qc/pre/
10:26:42 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$
sh pre_qc_multiqc.sh > ../log/pre_qc_multiqc.sh.log 2>&1
10:28:29 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
$


看結(jié)果
1.general statistical


2.fastqc
2.1 Sequence Counts


黑色的是重復(fù)序列,藍(lán)色的是有用的序列(reads)
可以看到重復(fù)序列的比例還是小的
2.2 Sequence Quality Histograms

每一條線是一個(gè)樣本,應(yīng)該打分都在綠色區(qū)域就會(huì)好一點(diǎn)
綠色區(qū)間——質(zhì)量很好,橙色區(qū)間——質(zhì)量合理,紅色區(qū)間——質(zhì)量不好

這60個(gè)都是綠色的線,代表測(cè)序質(zhì)量挺好的
2.3 Per Sequence Quality Scores

綠色區(qū)間——質(zhì)量很好、橙色區(qū)間——質(zhì)量合理、紅色區(qū)間——質(zhì)量不好

2.4 Per base Sequence Content


結(jié)果顯示3序列都warn,如果是紅色的說明每個(gè)位置每種堿基出現(xiàn)的概率差別很大,可能有過表達(dá)序列的污染。

2.5 Per Sequence GC Content



以下是一個(gè)錯(cuò)誤的例子

這里結(jié)果顯示四條序列都被報(bào)錯(cuò),從形狀上來看曲線和正態(tài)曲線相差甚遠(yuǎn),可能是由于文庫(kù)的污染或是部分reads構(gòu)成的子集有偏差造成的
2.6 Per Base N Content 每條reads各位置N堿基含量比例

說明測(cè)序儀器能辨別這60個(gè)樣本中每條reads的每個(gè)位置的堿基
2.7 Sequence Length Distribution 序列長(zhǎng)度分布

對(duì)于這幾個(gè)樣本,每次測(cè)序儀測(cè)出來的長(zhǎng)度主要都在76bp
2.8 Sequence Duplication Levels 每個(gè)序列的相對(duì)重復(fù)水平

2.9 Overrepresented sequences 文庫(kù)中過表達(dá)序列的比例

這些序列中過表達(dá)的序列的比例都遠(yuǎn)遠(yuǎn)超過1%,如果有的序列中過表達(dá)的序列超過50%,這種情況的出現(xiàn),不是這種轉(zhuǎn)錄本巨量表達(dá),就是樣品被污染。
2.10 Adapter Content 接頭含量

這是沒有接頭的意思????不太清楚了,一會(huì)和去接頭以后的再比較以下,看看有什么差別