劉小澤記錄于 2019.5.10
將Turner寫的一些好用的單行命令與大家分享,原文還有許多可以去看
https://github.com/stephenturner/oneliners
About Fastq/fasta
fastq sequences length distribution => 得到fq文件中序列長度的分布
$ zcat file.fastq.gz | awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}'
reverse complement => 反向互補(bǔ)
$ echo 'ATTGCTATGCTNNNT' | rev | tr 'ACTG' 'TGAC'
fastq2fasta
$ zcat file.fastq.gz | paste - - - - | perl -ane 'print ">$F[0]\n$F[1]\n";' | gzip -c > file.fasta.gz
split a multifasta file into single ones with csplit => fasta按>拆分
# * refers to the number of files 可以選擇拆分的文件數(shù)量
$ csplit -z -q -n 4 -f test test.fa /\>/ {*}
# OR use awk 一次性全部按>拆分
$ awk '/^>/{s=++d".fa"} {print > s}' multi.fa
single line fasta to multi-line of 50 characters in each line => 單行fa變多行
$ awk -v FS= '/^>/{print;next}{for (i=0;i<=NF/50;i++) {for (j=1;j<=50;j++) printf "%s", $(i*50 +j); print ""}}' file
# fold -w 50 file
multi-line fasta to one-line => 一個多行fa文件變單行
# 方法一:
$ awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file.fa
# 方法二:
$ cat file.fasta | awk '/^>/{if(N>0) printf("\n"); ++N; printf("%s\t",$0);next;} {printf("%s",$0);}END{printf("\n");}'
Number of reads in a fastq file => 統(tǒng)計fq中序列數(shù)(4行一個序列)
$ cat file.fq | echo $((`wc -l`/4))
print length of each entry in a multifasta file => 打印fa中每個序列的長度
$ awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen = seqlen +length($0)}END{print seqlen}' file.fa
subsample fastq => 取fq文件的子集(其中0.01是指取出來百分之1的reads)
$ cat file.fq | paste - - - - | awk 'BEGIN{srand(1234)}{if(rand() < 0.01) print $0}' | tr '\t' '\n' > out.fq
About Sam/Bam
bam2bed
samtools view file.bam | perl -F'\t' -ane '$strand=($F[1]&16)?"-":"+";$length=1;$tmp=$F[5];$tmp =~ s/(\d+)[MD]/$length+=$1/eg;print "$F[2]\t$F[3]\t".($F[3]+$length)."\t$F[0]\t0\t$strand\n";' > file.bed
bam2wig
samtools mpileup -BQ0 file.sorted.bam | perl -pe '($c, $start, undef, $depth) = split;if ($c ne $lastC || $start != $lastStart+1) {print "fixedStep chrom=$c start=$start step=1 span=1\n";}$_ = $depth."\n";($lastC, $lastStart) = ($c, $start);' | gzip -c > file.wig.gz
Basis Linux
get all folders' size in the current folder => 當(dāng)前目錄下的所有目錄大小
$ du -h --max-depth=1
exit a dead ssh session => 退出卡死的ssh界面
$ ~.
copy large folders fast => 快速拷貝大文件夾
# copy every file in folder 拷貝目錄下的所有文件
rsync -av from_dir/ to_dir
# skip transferred files 跳過已拷貝的文件
rsync -avhP from_dir/ to_dir
find bam in the current folder recursively and copy them to a new dir with 5 CPUs => 拷貝大文件(如bam)到其他文件夾,并用5個線程
find . -name "*bam" | xargs -P5 -I{} rsync -av {} dest_dir
group files by extensions => 按照后綴的順序排序文件
ll -X
loop through all the names => 循環(huán)語句
for i in {1..22} X Y
do
echo $i
done
# 對于{01..22} 的結(jié)果是 01 02 ...
GREP
grep fastq reads containing a pattern but maintain the fastq format => 匹配fq中序列并打印
# 例如要在SP1.fq中找到這段序列的fq格式
# 如果匹配到多個,那么每條序列中間會用--分隔,因此需要用sed去除
$ grep -A 2 -B 1 'TGAGACAACATCT' SP1.fq | sed '/^--$/d' > out.fq
SED
delete with sed => 刪除行
# delete blank lines
sed /^$/d
# delete the last line
sed $d
AWK
awk join two files with common columns => awk連接有共同列的文件(類似于R的merge函數(shù))
# http://stackoverflow.com/questions/13258604/join-two-files-using-awk
# file_a.bed:
chr1 123 aa b c d
chr1 234 a b c d
chr1 345 aa b c d
chr1 456 a b c d
# file_b.bed
xxxx abcd chr1 123 aa c d e
yyyy defg chr1 345 aa e f g
# 現(xiàn)在想在a的基礎(chǔ)上根據(jù)a、b共有列來增加b中的新內(nèi)容
$ awk 'NR==FNR{a[$3,$4,$5]=$1OFS$2;next}{$6=a[$1,$2,$3];print}' OFS='\t' \
file_b.bed file_a.bed
# 結(jié)果
chr1 123 aa b c xxxx abcd
chr1 234 a b c
chr1 345 aa b c yyyy defg
chr1 456 a b c
Explanation:
- NR==FNR NR is the current input line number and FNR the current file's line number. The two will be equal only while the 1st file is being read.
- OFS awk set the output field seperator; while set the input seperator is -F
- next means to proceed for the next line, rather than execute the following { } code block
awk to compare two different files and print if matches=> 比較兩個文件的指定列,然后打印比對上的行
# https://unix.stackexchange.com/questions/134829/compare-two-columns-of-different-files-and-print-if-it-matches
# 例如 file1
abc|123|BNY|apple|
cab|234|cyx|orange|
def|kumar|pki|bird|
# file2
abc|123|
kumar|pki|
cab|234
# expected
abc|123|BNY|apple|
cab|234|cyx|orange|
$ awk -F'|' 'NR==FNR{a[$1$2]++;next};a[$1$2] > 0' file2 file1
conditional operator => 條件判斷
基本格式:var=condition?condition_if_true:condition_if_false
例如:
# 現(xiàn)在有這個文件 test
a1 ACTGTCTGTCACTGTGTTGTGATGTTG
a2 ACTTTATATAT
a3 ACTTATATATATATA
a4 ACTTATATATATATA
a5 ACTTTATATATT
# 我想看看每行序列部分是不是大于14個堿基
$ awk '{print (length($2)>14)?$0">14":$0"<=14"}' test
get new line => 在原來的內(nèi)容基礎(chǔ)上增加新內(nèi)容
$ awk 'BEGIN{while((getline k <"test")>0) print "NEW:"k}{print}' test
merge multi-fasta into one single fasta => 合并多個fasta文件到一個文件中
# give a awk script called linearize.awk
$cat >linearize.awk
# then copy and paste below
/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}
# run the awk script
$ paste <(awk -f linearize.awk file1.fa ) <(awk -f linearize.awk file2.fa )| tr "\t" "\n" > multi.fa
根據(jù)id輸出序列
while read -r line; do awk -v pattern=$line -v RS=">" '$0 ~ pattern { printf(">%s", $0); }' Seq.fasta; done < id.txt > output.fa
需要指定Seq.fasta、id.txt