1.cut
cut用于文件分割,將整個文件分割成不同的字段,可以對字段進行單獨的操作,類似于excel中進行列操作。
-b, 按字節(jié)(bytes)定位;-c, 按字符(characters)定位;-f, 按域(fields)定位; -d ,設置間隔符,默認制表符(Tab)。
常用-d,-f。根據(jù)文件差異,有的分隔符是空格,有的是逗號、tab鍵等,-f指定要選擇哪列或哪些列。
cat查看文件內(nèi)容,此文件以逗號分割
Cat a.txt
SRR776504,RNA-Seq,646,8308777,PRJNA192864,SAMN01978938,24124470
SRR776504,RNA-Seq,646,8308777,PRJNA192864,SAMN01978938,24124470
SRR776505,RNA-Seq,579,26432223,PRJNA192864,SAMN01978939,78862401
SRR776505,RNA-Seq,652,26432336,PRJNA192864,SAMN01978939,78862611
SRR776505,RNA-Seq,652,26432254,PRJNA192864,SAMN01978939,78862001
SRR776506,RNA-Seq,577,24314163,PRJNA192864,SAMN01978940,72169712
cut –d ‘,’ –f 1 a.txt
逗號分割成7部分,取第一個字段
SRR776504
SRR776504
SRR776505
SRR776505
SRR776505
SRR776506
sed 's/,/\t/g' a.txt | cut -f 1-4
sed命令把,替換成Tab鍵,再取1至4字段;
SRR776504 RNA-Seq 646 8308777
SRR776504 RNA-Seq 646 8308777
SRR776505 RNA-Seq 579 26432223
SRR776505 RNA-Seq 652 26432336
SRR776505 RNA-Seq 652 26432254
SRR776506 RNA-Seq 577 24314163
|為管道符,用于連續(xù)操作,表示把前一條命令的結果傳輸給后一條命令,即前面的輸出作為后面的輸入
2.sort
sort用于排序,默認按字符編碼排序。-k指定按哪個字段排序;使用-n參數(shù)按數(shù)字大小排序;-u用于去重復,等同于sort | uniq;;-r,反向排序(默認升序);-t,指定分隔符
對前文取出的4個字段進行操作
先按第一列排序,再按第三列數(shù)字大小降序
sed 's/,/\t/g' a.txt | cut -f 1-4 | sort -k1,1 -k3,3nr
SRR776504 RNA-Seq 646 8308777
SRR776504 RNA-Seq 646 8308777
SRR776505 RNA-Seq 652 26432254
SRR776505 RNA-Seq 652 26432336
SRR776505 RNA-Seq 579 26432223
SRR776506 RNA-Seq 577 24314163
其中前2行重復
sed 's/,/\t/g' a.txt | cut -f 1-4 | sort -k1,1 -k3,3nr | uniq -c
2 SRR776504 RNA-Seq 646 8308777
1 SRR776505 RNA-Seq 652 26432254
1 SRR776505 RNA-Seq 652 26432336
1 SRR776505 RNA-Seq 579 26432223
1 SRR776506 RNA-Seq 577 24314163
#uniq用于去重,-c表示進行重復計數(shù),計數(shù)結果顯示在第一列;-d(duplication),獲得重復的行。
再比如
cat <<END | uniq -c
> a
> a
> a
> b
> c
> d
> d
> END
cat <<END | uniq -c
3 a
1 b
1 c
2 d
cat <<END后可鍵盤輸入內(nèi)容,前面會顯示>,最后輸入END結束,然后統(tǒng)計重復情況
cat <<END | uniq -d
> a
> a
> a
> b
> c
> d
> d
> END
a
d
#最后輸出重復的內(nèi)容a,d
3.grep
grep用于查找,-c,計數(shù)(count)
head t.gtf #查看文件的開頭部分內(nèi)容,-n指定查看前多少行;-v,反向查找
#!genome-version TAIR10
#!genome-date 2008-04
#!genome-build-accession GCA_000001735.1
#!genebuild-last-updated 2010-09
1 araport11 gene 10942648 10944727 . - . gene_id "AT1G30814"; gene_source "araport11"; gene_biotype "protein_coding";
1 araport11 transcript 10942648 10944727 . - . gene_id "AT1G30814"; transcript_id "AT1G30814.1"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
1 araport11 exon 10944317 10944727 . - . gene_id "AT1G30814"; transcript_id "AT1G30814.1"; exon_number "1"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G30814.1.exon1";
1 araport11 exon 10944078 10944229 . - . gene_id "AT1G30814"; transcript_id "AT1G30814.1"; exon_number "2"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G30814.1.exon2";
1 araport11 CDS 10944078 10944225 . - 0 gene_id "AT1G30814"; transcript_id "AT1G30814.1"; exon_number "2"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G30814.1";
grep "CDS" t.gtf
1 araport11 CDS 10944078 10944225 . - 0 gene_id "AT1G30814"; transcript_id "AT1G30814.1"; exon_number "2"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G30814.1";
1 araport11 CDS 10943868 10943984 . - 2 gene_id "AT1G30814"; transcript_id "AT1G30814.1"; exon_number "3"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G30814.1";
#"CDS"字樣會標識出顏色
grep -c "CDS" t.gtf
2
grep -v "#" t.gtf | head -3 #不看注釋行
1 araport11 gene 10942648 10944727 . - . gene_id "AT1G30814"; gene_source "araport11"; gene_biotype "protein_coding";
1 araport11 transcript 10942648 10944727 . - . gene_id "AT1G30814"; transcript_id "AT1G30814.1"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
1 araport11 exon 10944317 10944727 . - . gene_id "AT1G30814"; transcript_id "AT1G30814.1"; exon_number "1"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G30814.1.exon1";
cat b.txt
snpeff
snppppef
exper
database
data2
123
*123
grep "s" b.txt
snpeff
snppppef
database
grep "^s" b.txt #匹配s開頭的字符
snpeff
snppppef
grep "f$" b.txt #匹配f結尾的字符
snpeff
snppppef
grep "snp*ef" b.txt #匹配p 0次或多次
snpeff
snppppef
grep -E "^s|d" b.txt #-E,匹配正則表達,|,或表達
snpeff
snppppef
database
data2