學(xué)習(xí)Linux系統(tǒng)時都會學(xué)習(xí)這么幾個壓縮工具:gzip、bzip2、zip、xz,以及相關(guān)的解壓工具。關(guān)于這幾個工具的使用和相互之間的壓縮比以及壓縮時間對比可以看:Linux中歸檔壓縮工具學(xué)習(xí)
那么Pigz是什么呢?簡單的說,就是支持并行壓縮的gzip。Pigz默認用當前邏輯cpu個數(shù)來并發(fā)壓縮,無法檢測個數(shù)的話,則默認并發(fā)8個線程,也可以使用-p指定線程數(shù)。需要注意的是其CPU使用比較高。
官網(wǎng):http://zlib.net/pigz
安裝
yum install pigz
使用方法
$ pigz --help
Usage: pigz [options] [files ...]
will compress files in place, adding the suffix '.gz'. If no files are
specified, stdin will be compressed to stdout. pigz does what gzip does,
but spreads the work over multiple processors and cores when compressing.
Options:
-0 to -9, -11 Compression level (11 is much slower, a few % better)
--fast, --best Compression levels 1 and 9 respectively
-b, --blocksize mmm Set compression block size to mmmK (default 128K)
-c, --stdout Write all processed output to stdout (won't delete)
-d, --decompress Decompress the compressed input
-f, --force Force overwrite, compress .gz, links, and to terminal
-F --first Do iterations first, before block split for -11
-h, --help Display a help screen and quit
-i, --independent Compress blocks independently for damage recovery
-I, --iterations n Number of iterations for -11 optimization
-k, --keep Do not delete original file after processing
-K, --zip Compress to PKWare zip (.zip) single entry format
-l, --list List the contents of the compressed input
-L, --license Display the pigz license and quit
-M, --maxsplits n Maximum number of split blocks for -11
-n, --no-name Do not store or restore file name in/from header
-N, --name Store/restore file name and mod time in/from header
-O --oneblock Do not split into smaller blocks for -11
-p, --processes n Allow up to n compression threads (default is the
number of online processors, or 8 if unknown)
-q, --quiet Print no messages, even on error
-r, --recursive Process the contents of all subdirectories
-R, --rsyncable Input-determined block locations for rsync
-S, --suffix .sss Use suffix .sss instead of .gz (for compression)
-t, --test Test the integrity of the compressed input
-T, --no-time Do not store or restore mod time in/from header
-v, --verbose Provide more verbose output
-V --version Show the version of pigz
-z, --zlib Compress to zlib (.zz) instead of gzip format
-- All arguments after "--" are treated as files
原目錄大?。?/p>
[20:30 root@hulab /DataBase/Human/hg19]$ du -h
8.1G ./refgenome
1.4G ./encode_anno
4.2G ./hg19_index/hg19
8.1G ./hg19_index
18G .
接下來我們分別使用gzip以及不同線程數(shù)的pigz對h19_index目錄進行壓縮,比較其運行時間。
### 使用gzip進行壓縮(單線程)
[20:30 root@hulab /DataBase/Human/hg19]$ time tar -czvf index.tar.gz hg19_index/
hg19_index/
hg19_index/hg19.tar.gz
hg19_index/hg19/
hg19_index/hg19/genome.8.ht2
hg19_index/hg19/genome.5.ht2
hg19_index/hg19/genome.7.ht2
hg19_index/hg19/genome.6.ht2
hg19_index/hg19/genome.4.ht2
hg19_index/hg19/make_hg19.sh
hg19_index/hg19/genome.3.ht2
hg19_index/hg19/genome.1.ht2
hg19_index/hg19/genome.2.ht2
real 5m28.824s
user 5m3.866s
sys 0m35.314s
### 使用4線程的pigz進行壓縮
[20:36 root@hulab /DataBase/Human/hg19]$ ls
encode_anno hg19_index index.tar.gz refgenome
[20:38 root@hulab /DataBase/Human/hg19]$ time tar -cvf - hg19_index/ | pigz -p 4 > index_p4.tat.gz
hg19_index/
hg19_index/hg19.tar.gz
hg19_index/hg19/
hg19_index/hg19/genome.8.ht2
hg19_index/hg19/genome.5.ht2
hg19_index/hg19/genome.7.ht2
hg19_index/hg19/genome.6.ht2
hg19_index/hg19/genome.4.ht2
hg19_index/hg19/make_hg19.sh
hg19_index/hg19/genome.3.ht2
hg19_index/hg19/genome.1.ht2
hg19_index/hg19/genome.2.ht2
real 1m18.236s
user 5m22.578s
sys 0m35.933s
### 使用8線程的pigz進行壓縮
[20:42 root@hulab /DataBase/Human/hg19]$ time tar -cvf - hg19_index/ | pigz -p 8 > index_p8.tar.gz
hg19_index/
hg19_index/hg19.tar.gz
hg19_index/hg19/
hg19_index/hg19/genome.8.ht2
hg19_index/hg19/genome.5.ht2
hg19_index/hg19/genome.7.ht2
hg19_index/hg19/genome.6.ht2
hg19_index/hg19/genome.4.ht2
hg19_index/hg19/make_hg19.sh
hg19_index/hg19/genome.3.ht2
hg19_index/hg19/genome.1.ht2
hg19_index/hg19/genome.2.ht2
real 0m42.670s
user 5m48.527s
sys 0m28.240s
### 使用16線程的pigz進行壓縮
[20:43 root@hulab /DataBase/Human/hg19]$ time tar -cvf - hg19_index/ | pigz -p 16 > index_p16.tar.gz
hg19_index/
hg19_index/hg19.tar.gz
hg19_index/hg19/
hg19_index/hg19/genome.8.ht2
hg19_index/hg19/genome.5.ht2
hg19_index/hg19/genome.7.ht2
hg19_index/hg19/genome.6.ht2
hg19_index/hg19/genome.4.ht2
hg19_index/hg19/make_hg19.sh
hg19_index/hg19/genome.3.ht2
hg19_index/hg19/genome.1.ht2
hg19_index/hg19/genome.2.ht2
real 0m23.643s
user 6m24.054s
sys 0m24.923s
### 使用32線程的pigz進行壓縮
[20:43 root@hulab /DataBase/Human/hg19]$ time tar -cvf - hg19_index/ | pigz -p 32 > index_p32.tar.gz
hg19_index/
hg19_index/hg19.tar.gz
hg19_index/hg19/
hg19_index/hg19/genome.8.ht2
hg19_index/hg19/genome.5.ht2
hg19_index/hg19/genome.7.ht2
hg19_index/hg19/genome.6.ht2
hg19_index/hg19/genome.4.ht2
hg19_index/hg19/make_hg19.sh
hg19_index/hg19/genome.3.ht2
hg19_index/hg19/genome.1.ht2
hg19_index/hg19/genome.2.ht2
real 0m17.523s
user 7m27.479s
sys 0m29.283s
### 解壓文件
[21:00 root@hulab /DataBase/Human/hg19]$ time pigz -p 8 -d index_p8.tar.gz
real 0m27.717s
user 0m30.070s
sys 0m22.515s
各個壓縮時間的比較:
| 程序 | 線程數(shù) | 時間 |
|---|---|---|
| gzip | 1 | 5m28.824s |
| pigz | 4 | 1m18.236s |
| pigz | 8 | 0m42.670s |
| pigz | 16 | 0m23.643s |
| pigz | 32 | 0m17.523s |
從上面可以看出,使用多線程pigz進行壓縮能進行大大的縮短壓縮時間,特別是從單線程的gzip到4線程的pigz壓縮時間縮短了4倍,繼續(xù)加多線程數(shù),壓縮時間減少逐漸不那么明顯。
雖然pigz能大幅度的縮短運行時間,但這是以犧牲cpu為代價的,所以對于cpu使用較高的場景不太宜使用較高的線程數(shù),一般而言使用4線程或8線程較為合適。