與你分享生信好用的單行命令

劉小澤記錄于 2019.5.10
將Turner寫的一些好用的單行命令與大家分享,原文還有許多可以去看
https://github.com/stephenturner/oneliners

About Fastq/fasta

fastq sequences length distribution => 得到fq文件中序列長度的分布
$ zcat file.fastq.gz | awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}'  
reverse complement => 反向互補(bǔ)
$ echo 'ATTGCTATGCTNNNT' | rev | tr 'ACTG' 'TGAC'
fastq2fasta
$ zcat file.fastq.gz | paste - - - - | perl -ane 'print ">$F[0]\n$F[1]\n";' | gzip -c > file.fasta.gz
split a multifasta file into single ones with csplit => fasta按>拆分
# * refers to the number of files 可以選擇拆分的文件數(shù)量
$ csplit -z -q -n 4 -f test test.fa /\>/ {*}
# OR use awk 一次性全部按>拆分
$ awk '/^>/{s=++d".fa"} {print > s}' multi.fa
single line fasta to multi-line of 50 characters in each line => 單行fa變多行
$ awk -v FS= '/^>/{print;next}{for (i=0;i<=NF/50;i++) {for (j=1;j<=50;j++) printf "%s", $(i*50 +j); print ""}}' file

# fold -w 50 file
multi-line fasta to one-line => 一個多行fa文件變單行
# 方法一:
$ awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file.fa
# 方法二:
$ cat file.fasta | awk '/^>/{if(N>0) printf("\n"); ++N; printf("%s\t",$0);next;} {printf("%s",$0);}END{printf("\n");}'
Number of reads in a fastq file => 統(tǒng)計fq中序列數(shù)(4行一個序列)
$ cat file.fq | echo $((`wc -l`/4))
print length of each entry in a multifasta file => 打印fa中每個序列的長度
$ awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen = seqlen +length($0)}END{print seqlen}' file.fa
subsample fastq => 取fq文件的子集(其中0.01是指取出來百分之1的reads)
$ cat file.fq | paste - - - - | awk 'BEGIN{srand(1234)}{if(rand() < 0.01) print $0}' | tr '\t' '\n' > out.fq

About Sam/Bam

bam2bed
samtools view file.bam | perl -F'\t' -ane '$strand=($F[1]&16)?"-":"+";$length=1;$tmp=$F[5];$tmp =~ s/(\d+)[MD]/$length+=$1/eg;print "$F[2]\t$F[3]\t".($F[3]+$length)."\t$F[0]\t0\t$strand\n";' > file.bed
bam2wig
samtools mpileup -BQ0 file.sorted.bam | perl -pe '($c, $start, undef, $depth) = split;if ($c ne $lastC || $start != $lastStart+1) {print "fixedStep chrom=$c start=$start step=1 span=1\n";}$_ = $depth."\n";($lastC, $lastStart) = ($c, $start);' | gzip -c > file.wig.gz

Basis Linux

get all folders' size in the current folder => 當(dāng)前目錄下的所有目錄大小
$ du -h --max-depth=1
exit a dead ssh session => 退出卡死的ssh界面
$ ~.
copy large folders fast => 快速拷貝大文件夾
# copy every file in folder 拷貝目錄下的所有文件
rsync -av from_dir/ to_dir
# skip transferred files 跳過已拷貝的文件
rsync -avhP from_dir/ to_dir
find bam in the current folder recursively and copy them to a new dir with 5 CPUs => 拷貝大文件(如bam)到其他文件夾,并用5個線程
find . -name "*bam" | xargs -P5 -I{} rsync -av {} dest_dir
group files by extensions => 按照后綴的順序排序文件
ll -X
loop through all the names => 循環(huán)語句
for i in {1..22} X Y 
do
  echo $i
done
# 對于{01..22} 的結(jié)果是 01 02 ...

GREP

grep fastq reads containing a pattern but maintain the fastq format => 匹配fq中序列并打印
# 例如要在SP1.fq中找到這段序列的fq格式
# 如果匹配到多個,那么每條序列中間會用--分隔,因此需要用sed去除
$ grep -A 2 -B 1 'TGAGACAACATCT' SP1.fq | sed '/^--$/d' > out.fq

SED

delete with sed => 刪除行
# delete blank lines
sed /^$/d
# delete the last line
sed $d

AWK

awk join two files with common columns => awk連接有共同列的文件(類似于R的merge函數(shù))
# http://stackoverflow.com/questions/13258604/join-two-files-using-awk
# file_a.bed: 
chr1    123 aa  b   c   d
chr1    234 a   b   c   d
chr1    345 aa  b   c   d
chr1    456 a   b   c   d
# file_b.bed
xxxx    abcd    chr1    123 aa  c   d   e
yyyy    defg    chr1    345 aa  e   f   g
# 現(xiàn)在想在a的基礎(chǔ)上根據(jù)a、b共有列來增加b中的新內(nèi)容
$ awk 'NR==FNR{a[$3,$4,$5]=$1OFS$2;next}{$6=a[$1,$2,$3];print}' OFS='\t' \
file_b.bed file_a.bed

# 結(jié)果
chr1    123 aa  b   c   xxxx    abcd
chr1    234 a   b   c   
chr1    345 aa  b   c   yyyy    defg
chr1    456 a   b   c   

Explanation:

  • NR==FNR NR is the current input line number and FNR the current file's line number. The two will be equal only while the 1st file is being read.
  • OFS awk set the output field seperator; while set the input seperator is -F
  • next means to proceed for the next line, rather than execute the following { } code block
awk to compare two different files and print if matches=> 比較兩個文件的指定列,然后打印比對上的行
# https://unix.stackexchange.com/questions/134829/compare-two-columns-of-different-files-and-print-if-it-matches
# 例如 file1
abc|123|BNY|apple|
cab|234|cyx|orange|
def|kumar|pki|bird|
# file2
abc|123|
kumar|pki|
cab|234
# expected
abc|123|BNY|apple|
cab|234|cyx|orange|

$  awk -F'|' 'NR==FNR{a[$1$2]++;next};a[$1$2] > 0' file2 file1
conditional operator => 條件判斷

基本格式:var=condition?condition_if_true:condition_if_false

例如:

# 現(xiàn)在有這個文件 test
a1  ACTGTCTGTCACTGTGTTGTGATGTTG
a2  ACTTTATATAT
a3  ACTTATATATATATA
a4  ACTTATATATATATA
a5  ACTTTATATATT    
# 我想看看每行序列部分是不是大于14個堿基
$ awk '{print (length($2)>14)?$0">14":$0"<=14"}' test
get new line => 在原來的內(nèi)容基礎(chǔ)上增加新內(nèi)容
$ awk 'BEGIN{while((getline k <"test")>0) print "NEW:"k}{print}' test
merge multi-fasta into one single fasta => 合并多個fasta文件到一個文件中
# give a awk script called linearize.awk
$cat >linearize.awk 
# then copy and paste below
/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}

# run the awk script
$ paste <(awk -f linearize.awk file1.fa ) <(awk -f linearize.awk file2.fa  )| tr "\t" "\n" > multi.fa
根據(jù)id輸出序列
while read -r line; do awk -v pattern=$line -v RS=">" '$0 ~ pattern { printf(">%s", $0); }'  Seq.fasta; done < id.txt > output.fa 

需要指定Seq.fasta、id.txt

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容