1.上Genome Announcements網(wǎng)站(https://mra.asm.org/)找一篇細菌基因組文章;找到文章記載的SRA號;

1

2
- 由于步驟都是一樣的,我們只對菌株4041進行組裝。
2.從SRA數(shù)據(jù)庫上用prefetch下載該文件;
- 代碼:
prefetch SRR5513009
prefetch -
因為中間有失去連接,我們不敢保證下載的序列是否完整,上ftp官網(wǎng)看一下大小,都是600M,應該下載完整了。
下載的序列
3. Fastq-dump解壓,解壓為gz文件,可以節(jié)省空間。因為需要點時間,我們讓它在后臺運行。
fastq-dump --gzip --split-files ~/ncbi/public/sra/SRR5513009.sra &

5
4. Fastqc質控
wwwww77@wwwww77-VirtualBox:~/assembly$ fastqc SRR5513009_1.fastq.gz
Started analysis of SRR5513009_1.fastq.gz
Approx 5% complete for SRR5513009_1.fastq.gz
Approx 10% complete for SRR5513009_1.fastq.gz
Approx 15% complete for SRR5513009_1.fastq.gz
Approx 20% complete for SRR5513009_1.fastq.gz
Approx 25% complete for SRR5513009_1.fastq.gz
Approx 30% complete for SRR5513009_1.fastq.gz
Approx 35% complete for SRR5513009_1.fastq.gz
Approx 40% complete for SRR5513009_1.fastq.gz
Approx 45% complete for SRR5513009_1.fastq.gz
Approx 50% complete for SRR5513009_1.fastq.gz
Approx 55% complete for SRR5513009_1.fastq.gz
Approx 60% complete for SRR5513009_1.fastq.gz
Approx 65% complete for SRR5513009_1.fastq.gz
Approx 70% complete for SRR5513009_1.fastq.gz
Approx 75% complete for SRR5513009_1.fastq.gz
Approx 80% complete for SRR5513009_1.fastq.gz
Approx 85% complete for SRR5513009_1.fastq.gz
Approx 90% complete for SRR5513009_1.fastq.gz
Approx 95% complete for SRR5513009_1.fastq.gz
Analysis complete for SRR5513009_1.fastq.gz
wwwww77@wwwww77-VirtualBox:~/assembly$ fastqc SRR5513009_2.fastq.gz
Started analysis of SRR5513009_2.fastq.gz
Approx 5% complete for SRR5513009_2.fastq.gz
Approx 10% complete for SRR5513009_2.fastq.gz
Approx 15% complete for SRR5513009_2.fastq.gz
Approx 20% complete for SRR5513009_2.fastq.gz
Approx 25% complete for SRR5513009_2.fastq.gz
Approx 30% complete for SRR5513009_2.fastq.gz
Approx 35% complete for SRR5513009_2.fastq.gz
Approx 40% complete for SRR5513009_2.fastq.gz
Approx 45% complete for SRR5513009_2.fastq.gz
Approx 50% complete for SRR5513009_2.fastq.gz
Approx 55% complete for SRR5513009_2.fastq.gz
Approx 60% complete for SRR5513009_2.fastq.gz
Approx 65% complete for SRR5513009_2.fastq.gz
Approx 70% complete for SRR5513009_2.fastq.gz
Approx 75% complete for SRR5513009_2.fastq.gz
Approx 80% complete for SRR5513009_2.fastq.gz
Approx 85% complete for SRR5513009_2.fastq.gz
Approx 90% complete for SRR5513009_2.fastq.gz
Approx 95% complete for SRR5513009_2.fastq.gz
Analysis complete for SRR5513009_2.fastq.gz
- 我們可以下載html文件到Windows端看一下結果。
-
從中我們可以知道輸入文本的reads的數(shù)量是5843752,測序長度是35-151,GC含量是67%,有點高,但由于由于二代測序GC偏好性高,且深度越高,GC含量會越高。
-
從Per base sequence quality來看我們的reads大部分都在綠色區(qū)域,說明質量比較高。
-
堿基總體質量值也都在高質量區(qū)域。

fastqc.html

SRR5513009_1.fastq.gz

SRR5513009_1.fastq.gz

SRR5513009_1.fastq.gz

SRR5513009_2.fastq.gz

SRR5513009_2.fastq.gz

SRR5513009_2.fastq.gz
5.Trimmomatic去接頭:
-
由文章可知這些數(shù)據(jù)是由illumina平臺測序得到的,我們用Trimmomatic去除接頭,因為這個軟件其實就是專為illumina平臺數(shù)據(jù)而設計的。
mkdir trim_out
java -jar ~/Biosofts/Trimmomatic038/Trimmomatic-0.38/trimmomatic-0.38.jar PE -phred33 SRR5513009_1.fastq.gz SRR5513009_2.fastq.gz ./trim_out/output_forward_paired.fq.gz ./trim_out/output_forward_unpaired.fq.gz ./trim_out/output_reverse_paired.fq.gz ./trim_out/output_reverse_unpaired.fq.gz ILLUMINACLIP:/home/wwwww77/Biosofts/Trimmomatic038/Trimmomatic-0.38/adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:15 LEADING:5 TRAILING:5 MINLEN:50

Trimmomatic

trim_out
6.再次FastQC對過濾后的數(shù)據(jù)進行質量測評
wwwww77@wwwww77-VirtualBox:~/assembly$ fastqc trim_out/output_forward_paired.fq.gz
Started analysis of output_forward_paired.fq.gz
Approx 5% complete for output_forward_paired.fq.gz
Approx 10% complete for output_forward_paired.fq.gz
Approx 15% complete for output_forward_paired.fq.gz
Approx 20% complete for output_forward_paired.fq.gz
Approx 25% complete for output_forward_paired.fq.gz
Approx 30% complete for output_forward_paired.fq.gz
Approx 35% complete for output_forward_paired.fq.gz
Approx 40% complete for output_forward_paired.fq.gz
Approx 45% complete for output_forward_paired.fq.gz
Approx 50% complete for output_forward_paired.fq.gz
Approx 55% complete for output_forward_paired.fq.gz
Approx 60% complete for output_forward_paired.fq.gz
Approx 65% complete for output_forward_paired.fq.gz
Approx 70% complete for output_forward_paired.fq.gz
Approx 75% complete for output_forward_paired.fq.gz
Approx 80% complete for output_forward_paired.fq.gz
Approx 85% complete for output_forward_paired.fq.gz
Approx 90% complete for output_forward_paired.fq.gz
Approx 95% complete for output_forward_paired.fq.gz
Analysis complete for output_forward_paired.fq.gz
wwwww77@wwwww77-VirtualBox:~/assembly$ fastqc trim_out/output_reverse_paired.fq.gz
Started analysis of output_reverse_paired.fq.gz
Approx 5% complete for output_reverse_paired.fq.gz
Approx 10% complete for output_reverse_paired.fq.gz
Approx 15% complete for output_reverse_paired.fq.gz
Approx 20% complete for output_reverse_paired.fq.gz
Approx 25% complete for output_reverse_paired.fq.gz
Approx 30% complete for output_reverse_paired.fq.gz
Approx 35% complete for output_reverse_paired.fq.gz
Approx 40% complete for output_reverse_paired.fq.gz
Approx 45% complete for output_reverse_paired.fq.gz
Approx 50% complete for output_reverse_paired.fq.gz
Approx 55% complete for output_reverse_paired.fq.gz
Approx 60% complete for output_reverse_paired.fq.gz
Approx 65% complete for output_reverse_paired.fq.gz
Approx 70% complete for output_reverse_paired.fq.gz
Approx 75% complete for output_reverse_paired.fq.gz
Approx 80% complete for output_reverse_paired.fq.gz
Approx 85% complete for output_reverse_paired.fq.gz
Approx 90% complete for output_reverse_paired.fq.gz
Approx 95% complete for output_reverse_paired.fq.gz
Analysis complete for output_reverse_paired.fq.gz
-
為了方便看過濾后數(shù)據(jù)的質量對比,我們用MultiQC把結果整合成一個HTLM網(wǎng)頁交互式報告。
過濾后的正反序列質量報告居然一樣,multiqc直接把它們識別為一個報告文件了。
multiqc *.zip
multiqc -
把multiqc_report.html用WinSCP下載到本地查看
發(fā)現(xiàn)其實過濾效果并不十分明顯,其中reads重復率降低了一點,還有就是SRR5513009_2過濾前的每條reads各位置N堿基含量比例高了一點點,但其實也是處于高質量區(qū)域。
另外圖三也說明了原序列基本沒什么接頭污染。
1
2
3
7.Spades組裝基因組草圖:
- 原文有提到參數(shù)要求,文件是paired-end reads,要選用--careful來減少錯誤和插入缺失
Genome assemblies were produced with SPAdes genome assembler version 3.10 (14), set in “paired-end assembly, careful mode,”
wwwww77@wwwww77-VirtualBox:~/assembly/trim_out$ spades.py --careful --pe1-1 output_forward_paired.fq.gz --pe1-2 output_reverse_paired.fq.gz -o ./SPAdes_out
- 出現(xiàn)報錯了,我上網(wǎng)查了查SPAdes的err code :255是由于RAM不夠造成的。我們關閉虛擬機,把虛擬機的內存大小調大一點,我調到了5058MB。重啟后再執(zhí)行這個語句

報錯

內存調整
-
SPAdes組裝完成
image.png
8.Quast評價組裝的基因組效果
- 可以自定義參數(shù),skip contigs shorter than 200?bp
wwwww77@wwwww77-VirtualBox:~/assembly/trim_out$ quast.py SPAdes_out/contigs.fasta --min-contig 200 -o SPAdes_out/quast_out

quast
- quast執(zhí)行完成后結果有很多,我們可以直接查看report.txt;
我們也可查看其中的網(wǎng)頁版報告,一般看icarus.html,其為導航頁面,更便于查看更多結果。

quast結果·
-
一般contigs/scaffolds序列總數(shù)越少、序列總長度合理、N50等值越高長,組裝結果越好
從report.txt可以看到Arthrobacter sp. 4041總基因組長度為3912868 bp,GC含量為67.65%,N50值為536987bp 。
report.txt
image.png








