官方手冊:http://www.htslib.org/doc/samtools.html
Samtools 是一組實用程序,用于處理 SAM(序列對齊/映射)、BAM 和 CRAM 格式的對齊。它在格式之間進(jìn)行轉(zhuǎn)換,進(jìn)行排序、合并和索引,并且可以快速檢索任何區(qū)域的讀取。
1.下載:
(1)直接conda安裝
conda install -c bioconda samtools
#然后自己加入環(huán)境變量
#輸入samtools檢查一下是否安裝成功
2.使用
samtools
Program: samtools (Tools for alignments in the SAM format)
Version: 1.11 (using htslib 1.11)
Usage: samtools <command> [options]
Commands:
-- Indexing
dict create a sequence dictionary file
faidx index/extract FASTA
fqidx index/extract FASTQ
index index alignment
-- Editing
calmd recalculate MD/NM tags and '=' bases
fixmate fix mate information
reheader replace BAM header
targetcut cut fosmid regions (for fosmid pool only)
addreplacerg adds or replaces RG tags
markdup mark duplicates
ampliconclip clip oligos from the end of reads
-- File operations
collate shuffle and group alignments by name
cat concatenate BAMs
merge merge sorted alignments
mpileup multi-way pileup
sort sort alignment file
split splits a file by read group
quickcheck quickly check if SAM/BAM/CRAM file appears intact
fastq converts a BAM to a FASTQ
fasta converts a BAM to a FASTA
-- Statistics
bedcov read depth per BED region
coverage alignment depth and percent coverage
depth compute the depth
flagstat simple stats
idxstats BAM index stats
phase phase heterozygotes
stats generate stats (former bamcheck)
ampliconstats generate amplicon specific stats
-- Viewing
flags explain BAM flags
tview text alignment viewer
view SAM<->BAM<->CRAM conversion
depad convert padded BAM to unpadded BAM
3.常見操作
view
samtools view [options] in.sam|in.bam|in.cram [region...]
Output options:
-b, --bam Output BAM 輸出BAM格式
-C, --cram Output CRAM (requires -T) 輸出CRAM文件
-1, --fast Use fast BAM compression (and default to --bam) 快速壓縮(需要-b)
-u, --uncompressed Uncompressed BAM output (and default to --bam) 輸出未壓縮的bam文件,節(jié)約時間,占據(jù)較多磁盤空間(需要-b)
-h, --with-header Include header in SAM output 默認(rèn)輸出sam文件不帶表頭,該參數(shù)設(shè)定后輸出帶表頭信息
-H, --header-only Print SAM header only (no alignments) 僅僅輸出表頭信息
--no-header Print SAM alignment records only [default]
-c, --count Print only the count of matching records 僅打印匹配數(shù)
-o, --output FILE Write output to FILE [standard output] 輸出文件(stdout標(biāo)準(zhǔn)輸出)
-U, --unoutput FILE, --output-unselected FILE
Output reads not selected by filters to FILE 輸出沒有經(jīng)過過濾選擇的reads
-p, --unmap Set flag to UNMAP on reads not selected
then write to output file.
-P, --fetch-pairs Retrieve complete pairs even when outside of region
Input options:
-t, --fai-reference FILE FILE listing reference names and lengths 制表分隔符文件(需要提供額外的參考數(shù)據(jù),比如參考基因組的索引)
-M, --use-index Use index and multi-region iterator for regions
--region[s]-file FILE Use index to include only reads overlapping FILE
-X, --customized-index Expect extra index file argument after <in.bam>
Filtering options (Only include in output reads that...):
-L, --target[s]-file FILE ...overlap (BED) regions in FILE 僅包括和bed文件有重疊的reads
-r, --read-group STR ...are in read group STR 僅輸出在STR讀段組中的reads
-R, --read-group-file FILE ...are in a read group listed in FILE 僅輸出特定reads
-N, --qname-file FILE ...whose read name is listed in FILE
-d, --tag STR1[:STR2] ...have a tag STR1 (with associated value STR2)
-D, --tag-file STR:FILE ...have a tag STR whose value is listed in FILE
-q, --min-MQ INT ...have mapping quality >= INT 定位的質(zhì)量大于INT[默認(rèn)0]
-l, --library STR ...are in library STR 僅輸出在STR 庫中的reads
-m, --min-qlen INT ...cover >= INT query bases (as measured via CIGAR)
-e, --expr STR ...match the filter expression STR
-f, --require-flags FLAG ...have all of the FLAGs present 獲得未必對上(unmapped)的過濾設(shè)置[默認(rèn)0]
-F, --excl[ude]-flags FLAG ...have none of the FLAGs present
--rf, --incl-flags, --include-flags FLAG 獲得比對上(mapped)的過濾設(shè)置[默認(rèn)0]
...have some of the FLAGs present
-G FLAG EXCLUDE reads with all of the FLAGs present
--subsample FLOAT Keep only FLOAT fraction of templates/read pairs
--subsample-seed INT Influence WHICH reads are kept in subsampling [0]
-s INT.FRAC Same as --subsample 0.FRAC --subsample-seed INT
Processing options:
--add-flags FLAG Add FLAGs to reads
--remove-flags FLAG Remove FLAGs from reads
-x, --remove-tag STR
Comma-separated read tags to strip (repeatable) [null]
--keep-tag STR
Comma-separated read tags to preserve (repeatable) [null].
Equivalent to "-x ^STR"
-B, --remove-B Collapse the backward CIGAR operation
General options:
-?, --help Print long help, including note about region specification
-S Ignored (input format is auto-detected)
--no-PG Do not add a PG line
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
-T, --reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]
--write-index
Automatically index the output files [off]
--verbosity INT
Set level of verbosity
- 要對來自BWA或其他比對工具產(chǎn)生的SAM輸出文件進(jìn)行進(jìn)一步的操作,我們需要首先將SAM轉(zhuǎn)換為其對應(yīng)的二進(jìn)制格式BAM。二進(jìn)制格式對于計算機來說更容易處理,而且減小了內(nèi)存。要將SAM轉(zhuǎn)換為BAM,使用samtools view命令,使用-S參數(shù)指定我們的輸入是SAM格式(默認(rèn)BAM),使用-b參數(shù)指定輸出格式是BAM(默認(rèn)BAM)。Samtools會將返回結(jié)果發(fā)送到標(biāo)準(zhǔn)輸出(UNIX STDOUT),因此需要使用重定向運算符 (“>”) 將其重定向到BAM文件,或者添加-o參數(shù),指定輸出文件名。
- view命令中,對sam文件頭部(序列ID)的輸入(-t或-T)和輸出(-h)是單獨的一些參數(shù)來控制的。
# 將sam文件轉(zhuǎn)換成bam文件
samtools view -S -b sample.sam > sample.bam
samtools view -S -b sample.sam -o sample.bam
samtools view -bS abc.sam > abc.bam
# BAM轉(zhuǎn)換為SAM
samtools view -h -o out.sam out.bam
samtools view -h out.sam > out.bam
# 提取比對到參考序列上的比對結(jié)果
samtools view -bF 4 abc.bam > abc.F.bam
# 提取paired reads中兩條reads都比對到參考序列上的比對結(jié)果,只需要把兩個4+8的值12作為過濾參數(shù)即可
samtools view -bF 12 abc.bam > abc.F12.bam
#篩選出比對質(zhì)量值大于30的情況
samtools view -q 30 abc.bam|awk '{print $1,$5}' | grep '[IDNSHPX]' |head -4
#篩選出比對成功,但是并不是完全匹配的序列
samtools view -q 30 abc.bam|awk '{print $6}' | grep '[IDNSHPX]' |head -4
# 提取沒有比對到參考序列上的比對結(jié)果
samtools view -bf 4 abc.bam > abc.f.bam
#比對到反向互補鏈的reads
samtools view -f 16 abc.bam |head -1
#比對到正向鏈的reads
samtools view -F 16 abc.bam |head -1
#統(tǒng)計共有多少條reads(pair-end-reads這里算一條)參與了比對參考基因組
echo `samtools view -c abc.bam`/2 |bc
# 提取bam文件中比對到caffold1上的比對結(jié)果,并保存到sam文件格式
samtools view abc.bam scaffold1 > scaffold1.sam
# 提取scaffold1上能比對到30k到100k區(qū)域的比對結(jié)果
samtools view abc.bam scaffold1:30000-100000 $gt; scaffold1_30k-100k.sam
# 根據(jù)fasta文件,將 header 加入到 sam 或 bam 文件中
samtools view -T genome.fasta -h scaffold1.sam > scaffold1.h.sam
#除格式轉(zhuǎn)換外,顧名思義,view用于查看文件內(nèi)容
samtools view sample.bam | head