samtools—手冊

官方手冊:http://www.htslib.org/doc/samtools.html

Samtools 是一組實用程序,用于處理 SAM(序列對齊/映射)、BAM 和 CRAM 格式的對齊。它在格式之間進(jìn)行轉(zhuǎn)換,進(jìn)行排序、合并和索引,并且可以快速檢索任何區(qū)域的讀取。

1.下載:

(1)直接conda安裝

conda install -c bioconda samtools
#然后自己加入環(huán)境變量
#輸入samtools檢查一下是否安裝成功

2.使用

samtools
Program: samtools (Tools for alignments in the SAM format)
Version: 1.11 (using htslib 1.11)

Usage:   samtools <command> [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     fqidx          index/extract FASTQ
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags
     markdup        mark duplicates
     ampliconclip   clip oligos from the end of reads

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA

  -- Statistics
     bedcov         read depth per BED region
     coverage       alignment depth and percent coverage
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)
     ampliconstats  generate amplicon specific stats

  -- Viewing
     flags          explain BAM flags
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM

3.常見操作

view

samtools view [options] in.sam|in.bam|in.cram [region...]
Output options:
  -b, --bam                  Output BAM    輸出BAM格式
  -C, --cram                 Output CRAM (requires -T)    輸出CRAM文件
  -1, --fast                 Use fast BAM compression (and default to --bam)    快速壓縮(需要-b)
  -u, --uncompressed         Uncompressed BAM output (and default to --bam) 輸出未壓縮的bam文件,節(jié)約時間,占據(jù)較多磁盤空間(需要-b)
  -h, --with-header          Include header in SAM output  默認(rèn)輸出sam文件不帶表頭,該參數(shù)設(shè)定后輸出帶表頭信息
  -H, --header-only          Print SAM header only (no alignments)  僅僅輸出表頭信息
      --no-header            Print SAM alignment records only [default]  
  -c, --count                Print only the count of matching records   僅打印匹配數(shù)
  -o, --output FILE          Write output to FILE [standard output]  輸出文件(stdout標(biāo)準(zhǔn)輸出)
  -U, --unoutput FILE, --output-unselected FILE
                             Output reads not selected by filters to FILE 輸出沒有經(jīng)過過濾選擇的reads
  -p, --unmap                Set flag to UNMAP on reads not selected
                             then write to output file.
  -P, --fetch-pairs          Retrieve complete pairs even when outside of region
Input options:
  -t, --fai-reference FILE   FILE listing reference names and lengths 制表分隔符文件(需要提供額外的參考數(shù)據(jù),比如參考基因組的索引)
  -M, --use-index            Use index and multi-region iterator for regions 
      --region[s]-file FILE  Use index to include only reads overlapping FILE
  -X, --customized-index     Expect extra index file argument after <in.bam>

Filtering options (Only include in output reads that...):
  -L, --target[s]-file FILE  ...overlap (BED) regions in FILE 僅包括和bed文件有重疊的reads
  -r, --read-group STR       ...are in read group STR 僅輸出在STR讀段組中的reads
  -R, --read-group-file FILE ...are in a read group listed in FILE 僅輸出特定reads
  -N, --qname-file FILE      ...whose read name is listed in FILE
  -d, --tag STR1[:STR2]      ...have a tag STR1 (with associated value STR2)
  -D, --tag-file STR:FILE    ...have a tag STR whose value is listed in FILE
  -q, --min-MQ INT           ...have mapping quality >= INT 定位的質(zhì)量大于INT[默認(rèn)0]
  -l, --library STR          ...are in library STR 僅輸出在STR 庫中的reads
  -m, --min-qlen INT         ...cover >= INT query bases (as measured via CIGAR)
  -e, --expr STR             ...match the filter expression STR
  -f, --require-flags FLAG   ...have all of the FLAGs present  獲得未必對上(unmapped)的過濾設(shè)置[默認(rèn)0]
  -F, --excl[ude]-flags FLAG ...have none of the FLAGs present
      --rf, --incl-flags, --include-flags FLAG   獲得比對上(mapped)的過濾設(shè)置[默認(rèn)0]
                             ...have some of the FLAGs present
  -G FLAG                    EXCLUDE reads with all of the FLAGs present
      --subsample FLOAT      Keep only FLOAT fraction of templates/read pairs
      --subsample-seed INT   Influence WHICH reads are kept in subsampling [0]
  -s INT.FRAC                Same as --subsample 0.FRAC --subsample-seed INT

Processing options:
      --add-flags FLAG       Add FLAGs to reads
      --remove-flags FLAG    Remove FLAGs from reads
  -x, --remove-tag STR
               Comma-separated read tags to strip (repeatable) [null]
      --keep-tag STR
               Comma-separated read tags to preserve (repeatable) [null].
               Equivalent to "-x ^STR"
  -B, --remove-B             Collapse the backward CIGAR operation

General options:
  -?, --help   Print long help, including note about region specification
  -S           Ignored (input format is auto-detected)
      --no-PG  Do not add a PG line
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
  -T, --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]
      --write-index
               Automatically index the output files [off]
      --verbosity INT
               Set level of verbosity

  1. 要對來自BWA或其他比對工具產(chǎn)生的SAM輸出文件進(jìn)行進(jìn)一步的操作,我們需要首先將SAM轉(zhuǎn)換為其對應(yīng)的二進(jìn)制格式BAM。二進(jìn)制格式對于計算機來說更容易處理,而且減小了內(nèi)存。要將SAM轉(zhuǎn)換為BAM,使用samtools view命令,使用-S參數(shù)指定我們的輸入是SAM格式(默認(rèn)BAM),使用-b參數(shù)指定輸出格式是BAM(默認(rèn)BAM)。Samtools會將返回結(jié)果發(fā)送到標(biāo)準(zhǔn)輸出(UNIX STDOUT),因此需要使用重定向運算符 (“>”) 將其重定向到BAM文件,或者添加-o參數(shù),指定輸出文件名。
  2. view命令中,對sam文件頭部(序列ID)的輸入(-t或-T)和輸出(-h)是單獨的一些參數(shù)來控制的。
# 將sam文件轉(zhuǎn)換成bam文件
samtools view -S -b sample.sam > sample.bam
samtools view -S -b sample.sam -o sample.bam
samtools view -bS abc.sam > abc.bam

# BAM轉(zhuǎn)換為SAM
samtools view -h -o out.sam out.bam
samtools view -h  out.sam  > out.bam

# 提取比對到參考序列上的比對結(jié)果
samtools view -bF 4 abc.bam > abc.F.bam

# 提取paired reads中兩條reads都比對到參考序列上的比對結(jié)果,只需要把兩個4+8的值12作為過濾參數(shù)即可
samtools view -bF 12 abc.bam > abc.F12.bam

#篩選出比對質(zhì)量值大于30的情況
samtools view -q 30 abc.bam|awk '{print $1,$5}' | grep '[IDNSHPX]' |head -4

#篩選出比對成功,但是并不是完全匹配的序列
samtools view -q 30 abc.bam|awk '{print $6}' | grep '[IDNSHPX]' |head -4

# 提取沒有比對到參考序列上的比對結(jié)果
samtools view -bf 4 abc.bam > abc.f.bam

#比對到反向互補鏈的reads
samtools view -f 16 abc.bam |head -1

#比對到正向鏈的reads
samtools view -F 16 abc.bam |head -1

#統(tǒng)計共有多少條reads(pair-end-reads這里算一條)參與了比對參考基因組
echo `samtools view -c abc.bam`/2 |bc

# 提取bam文件中比對到caffold1上的比對結(jié)果,并保存到sam文件格式
samtools view abc.bam scaffold1 > scaffold1.sam

# 提取scaffold1上能比對到30k到100k區(qū)域的比對結(jié)果
samtools view abc.bam scaffold1:30000-100000 $gt; scaffold1_30k-100k.sam

# 根據(jù)fasta文件,將 header 加入到 sam 或 bam 文件中
samtools view -T genome.fasta -h scaffold1.sam > scaffold1.h.sam

#除格式轉(zhuǎn)換外,顧名思義,view用于查看文件內(nèi)容
samtools view sample.bam | head
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容