翻譯 Compaction wiki

網(wǎng)址:https://github.com/facebook/rocksdb/wiki/Compaction

有道

Compaction

Compaction algorithms constrain the LSM tree shape. They determine which sorted runs can be merged by it and which sorted runs need to be accessed for a read operation. You can read more on RocksDB Compactions here: Multi-threaded compactions
壓縮算法約束LSM樹的形狀。它們決定哪些已排序的運行可以被它合并,哪些已排序的運行需要被讀取操作訪問。你可以在這里閱讀更多關于RocksDB Compactions的內(nèi)容:多線程Compactions

LSM terminology and metaphors

Let us first establish the different, sometimes mixed, metaphors and terminology used in describing LSM levels and structure.
讓我們首先建立描述LSM級別和結構時使用的不同的、有時是混合的隱喻和術語。

  • A level is above another level if its number is lower. For example, L1 is above L2
    如果一個級別的數(shù)字較低,則該級別高于另一個級別。例如,L1大于L2
  • The lowest-numbered level, L0, can be called the top level or first level.
    編號最低的級別L0可以稱為最高級別或第一級別。
    • A version of a key in L0 must be newer than versions of that same key in all levels below L0.
      L0中鍵的版本必須比L0以下所有級別中相同鍵的版本更新。
    • Thus, L0 is sometimes loosely referred to as the level containing the newest data.
      因此,L0有時被松散地稱為包含最新數(shù)據(jù)的級別。
  • A level is below another level if its number is higher. For example, L2 is below L1.
    如果一個級別的數(shù)字高于另一個級別,則該級別低于另一個級別。例如,L2低于L1。
  • The highest-numbered level, Lmax, can be called the bottom-most or last level.
    編號最高的級別Lmax可以稱為最底層或最后一個級別。
    • A version of a key in Lmax must be older than versions of that same key in all levels above Lmax.
      在Lmax中,一個key的版本必須比該key在Lmax以上所有級別中的版本更舊。
    • Thus, Lmax is sometimes loosely referred to as the level containing the oldest data.
      因此,Lmax有時被松散地稱為包含最古老數(shù)據(jù)的級別。
  • When talking about a particular key or key-range, a level is considered bottom-most when that level contains data for that key or key-range and no below level contains data for it.
    當談到一個特定的鍵或鍵范圍時,當這一層包含該鍵或鍵范圍的數(shù)據(jù),而下面的層不包含該鍵或鍵范圍的數(shù)據(jù)時,這一層被認為是最底部的。

Overview of Compaction algorithms

Source: https://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html

Here we present a taxonomy of compaction algorithms: Classic Leveled, Tiered, Tiered+Leveled, Leveled-N, FIFO. Out of them, Rocksdb implements Tiered+Leveled (termed Level Compaction in the code), Tiered (termed Universal in the code), and FIFO.
在這里,我們提出了一個分類的壓縮算法:經(jīng)典分層,分層,分層+分層,level - n, FIFO。除此之外,Rocksdb還實現(xiàn)了分層+分層(在代碼中稱為Level Compaction)、分層(在代碼中稱為Universal)和FIFO。

Classic Leveled

Classic Leveled compaction, introduced by LSM-tree paper by O'Neil et al, minimizes space amplification at the cost of read and write amplification.
O'Neil等人通過LSM-tree紙引入的經(jīng)典平級壓縮方法,以讀寫放大為代價,將空間放大最小化。

The LSM tree is a sequence of levels. Each level is one sorted run that can be range partitioned into many files. Each level is many times larger than the previous level. The size ratio of adjacent levels is sometimes called the fanout and write amplification is minimized when the same fanout is used between all levels. Compaction into level N (Ln) merges data from Ln-1 into Ln. Compaction into Ln rewrites data that was previously merged into Ln. The per-level write amplification is equal to the fanout in the worst case, but it tends to be less than the fanout in practice as explained in this paper by Hyeontaek Lim et al. Compaction in the original LSM paper was all-to-all -- all data from Ln-1 is merged with all data from Ln. It is some-to-some for LevelDB and RocksDB -- some data from Ln-1 is merged with some (the overlapping) data in Ln.
LSM樹是一個級別序列。每個關卡都是一個有序的運行,可以劃分到許多文件中。每個關卡都比之前的關卡大很多倍。相鄰電平的大小比有時稱為扇出,當在所有電平之間使用相同的扇出時,寫放大被最小化。壓縮到N級(Ln)將Ln-1中的數(shù)據(jù)合并到Ln中。壓縮成Ln重寫了以前合并成Ln的數(shù)據(jù)。在最壞的情況下,每級寫放大等于扇出,但它往往比Hyeontaek Lim等人在本文中解釋的實際扇出要少。在原始的LSM論文中,壓縮是all-to-all——來自Ln-1的所有數(shù)據(jù)與來自Ln的所有數(shù)據(jù)合并。對于LevelDB和RocksDB來說,這是相當重要的——來自Ln-1的一些數(shù)據(jù)與Ln中的一些(重疊的)數(shù)據(jù)合并在一起。

While write amplification is usually worse with leveled than with tiered, there are a few cases where leveled is competitive. The first is key-order inserts and a RocksDB optimization greatly reduces write-amp in that case. The second one is skewed writes where only a small fraction of the keys are likely to be updated. With the right value for compaction priority in RocksDB compaction should stop at the smallest level that is large enough to capture the write working set -- it won't go all the way to the max level. When leveled compaction is some-to-some then compaction is only done for the slices of the LSM tree that overlap the written keys, which can generate less write amplification than all-to-all compaction.
雖然分級寫作的放大效果通常不如分級寫作,但也有少數(shù)情況下,分級寫作是有競爭力的。首先是鍵序插入,RocksDB的優(yōu)化大大減少了這種情況下的寫放大器。第二個是傾斜寫入,只有一小部分鍵可能被更新。在RocksDB中,如果壓縮優(yōu)先級的值是正確的,那么壓縮就應該停止在最小的級別上,而該級別的大小足以捕獲寫工作集,而不會一直達到最大級別。當level -to-some壓縮時,只對LSM樹中重疊寫入鍵的部分進行壓縮,這比all-to-all壓縮生成的寫放大要少。

Leveled-N

Leveled-N compaction is like leveled compaction but with less write and more read amplification. It allows more than one sorted run per level. Compaction merges all sorted runs from Ln-1 into one sorted run from Ln, which is leveled. And then "-N" is added to the name to indicate there can be n sorted runs per level. The Dostoevsky paper defined a compaction algorithm named Fluid LSM in which the max level has 1 sorted run but the non-max levels can have more than 1 sorted run. Leveled compaction is done into the max level.
level - n壓縮類似于level - level壓縮,但是寫放大更少,讀放大更多。它允許每個關卡有多個有序運行。壓縮將所有從Ln-1排序的運行合并為一個從Ln排序的運行,它是平的。然后將“-N”添加到名稱中,表示每個級別可以有n次排序運行。Dostoevsky的論文定義了一個名為Fluid LSM的壓縮算法,其中最大級別有1次排序運行,但非最大級別可以有1次以上的排序運行。平整的壓實是做到最大的水平。

Tiered

Tiered compaction minimizes write amplification at the cost of read and space amplification.
分層壓縮以讀取和空間放大為代價最小化寫放大。

The LSM tree can still be viewed as a sequence of levels as explained in the Dostoevsky paper by Niv Dayan and Stratos Idreos. Each level has N sorted runs. Each sorted run in Ln is ~N times larger than a sorted run in Ln-1. Compaction merges all sorted runs in one level to create a new sorted run in the next level. N in this case is similar to fanout for leveled compaction. Compaction does not read/rewrite sorted runs in Ln when merging into Ln. The per-level write amplification is 1 which is much less than for leveled where it was fanout.
LSM樹仍然可以被看作是一個層次序列,這是由Niv Dayan和Stratos Idreos在陀思妥耶夫斯基的論文中所解釋的。每一層有N次排序運行。Ln中的每一次有序運行都比Ln-1中的一次有序運行大~N倍。壓縮合并一個級別中的所有排序運行,以在下一個級別中創(chuàng)建一個新的排序運行。在本例中,N類似于平壓實的扇出。當合并到Ln時,壓縮不會讀取/重寫Ln中的排序運行。每個級別的寫入放大是1,這比同級的扇出要小得多。

A common approach for tiered is to merge sorted runs of similar size, without having the notion of levels (which imply a target for the number of sorted runs of specific sizes). Most include some notion of major compaction that includes the largest sorted run and conditions that trigger major and non-major compaction. Too many files and too many bytes are typical conditions.
分級的一種常見方法是合并類似大小的有序運行,而不需要有級別的概念(這意味著特定大小的有序運行的目標數(shù)量)。大多數(shù)都包含一些主要壓縮的概念,包括最大的排序運行和觸發(fā)主要和非主要壓縮的條件。太多的文件和太多的字節(jié)是典型的情況。

There are a few challenges with tiered compaction:
分層壓縮有幾個挑戰(zhàn):

  • Transient space amplification is large when compaction includes a sorted run from the max level.
  • The block index and bloom filter for large sorted runs will be large. Splitting them into smaller parts is a good idea.
  • Compaction for large sorted runs takes a long time. Multi-threading would help.
  • Compaction is all-to-all. When there is skew and most of the keys don't get updates, large sorted runs might get rewritten because compaction is all-to-all. In a traditional tiered algorithm there is no way to rewrite a subset of a large sorted run.

For tiered compaction the notion of levels are usually a concept to reason about the shape of the LSM tree and estimate write amplification. With RocksDB they are also an implementation detail. The levels of the LSM tree beyond L0 can be used to store the larger sorted runs. The benefit from this is to partition large sorted runs into smaller SSTs. This reduces the size of the largest bloom filter and block index chunks -- which is friendlier to the block cache -- and was a big deal before partitioned index/filter was supported. With subcompactions this enables multi-threaded compaction of the largest sorted runs. Note that RocksDB used the name universal rather than tiered.

Tiered compaction in RocksDB code base is termed Universal Compaction.

Tiered+Leveled

Tiered+Leveled has less write amplification than leveled and less space amplification than tiered.

The tiered+leveled approach is a hybrid that uses tiered for the smaller levels and leveled for the larger levels. It is flexible about the level at which the LSM tree switches from tiered to leveled. For now I assume that if Ln is leveled then all levels that follow (Ln+1, Ln+2, ...) must be leveled.

SlimDB from VLDB 2018 is an example of tiered+leveled although it might allow Lk to be tiered when Ln is leveled for k > n. Fluid LSM is described as tiered+leveled but I think it is leveled-N.

Leveled compaction in RocksDB is also tiered+leveled. There can be N sorted runs at the memtable level courtesy of the max_write_buffer_number option -- only one is active for writes, the rest are read-only waiting to be flushed. A memtable flush is similar to tiered compaction -- the memtable output creates a new sorted run in L0 and doesn't read/rewrite existing sorted runs in L0. There can be N sorted runs in level 0 (L0) courtesy of level0_file_num_compaction_trigger. So the L0 is tiered. Compaction isn't done into the memtable level so it doesn't have to be labeled as tiered or leveled. Subcompactions in the RocksDB L0 makes this even more interesting, but that is a topic for another post.

FIFO

The FIFOStyle Compaction drops oldest file when obsolete and can be used for cache-like data.

Options

Here we give overview of the options that impact behavior of Compactions:

AdvancedColumnFamilyOptions::compaction_style - RocksDB currently supports four compaction algorithms - kCompactionStyleLevel(default), kCompactionStyleUniversal, kCompactionStyleFIFO and kCompactionStyleNone. If kCompactionStyleNone is selected, compaction has to be triggered manually by calling CompactRange() or CompactFiles()). Level compaction options are available under AdvancedColumnFamilyOptions. Universal Compaction options are available in AdvancedColumnFamilyOptions::compaction_options_universal and FIFO compaction options available in AdvancedColumnFamilyOptions::compaction_options_fifo
ColumnFamilyOptions::disable_auto_compactions - This dynamically changeable setting can be used by the application to disable automatic compactions. Manual compactions can still be issued on this database.
ColumnFamilyOptions::compaction_filter - Allows an application to modify/delete a key-value during background compaction (single instance). The client must provide compaction_filter_factory if it requires a new compaction filter to be used for different compaction processes. Client should specify only one of filter or factory.
ColumnFamilyOptions::compaction_filter_factory - a factory that provides compaction filter objects which allow an application to modify/delete a key-value during background compaction. A new filter will be created for each compaction run.
Other options impacting performance of compactions and when they get triggered are:

DBOptions::access_hint_on_compaction_start (Default: NORMAL) - Specify the file access pattern once a compaction is started. It will be applied to all input files of a compaction. Other AccessHint settings - NONE, SEQUENTIAL, WILLNEED
ColumnFamilyOptions::level0_file_num_compaction_trigger (Default: 4) - Number of files to trigger level-0 compaction. A negative value means that level-0 compaction will not be triggered by number of files at all.
AdvancedColumnFamilyOptions::target_file_size_base and AdvancedColumnFamilyOptions::target_file_size_multiplier - Target file size for compaction. target_file_size_base is per-file size for level-1. Target file size for level L can be calculated by target_file_size_base * (target_file_size_multiplier ^ (L-1)) For example, if target_file_size_base is 2MB and target_file_size_multiplier is 10, then each file on level-1 will be 2MB, and each file on level 2 will be 20MB, and each file on level-3 will be 200MB. Default target_file_size_base is 64MB and default target_file_size_multiplier is 1.
AdvancedColumnFamilyOptions::max_compaction_bytes (Default: target_file_size_base * 25) - Maximum number of bytes in all compacted files. We avoid expanding the lower level file set of a compaction if it would make the total compaction cover more than this amount.
DBOptions::max_background_jobs (Default: 2) - Maximum number of concurrent background jobs (compactions and flushes)
DBOptions::compaction_readahead_size - If non-zero, we perform bigger reads when doing compaction. If you're running RocksDB on spinning disks, you should set this to at least 2MB. We enforce it to be 2MB if you don't set it with direct I/O.
Compaction can also be manually triggered. See Manual Compaction

See include/rocksdb/options.h and include/rocksdb/advanced_options.h for detailed explanation of these options

Leveled style compaction

See Leveled Compaction.

Universal style compaction

For description about universal style compaction, see Universal compaction style

If you're using Universal style compaction, there is an object CompactionOptionsUniversal that holds all the different options for that compaction. The exact definition is in rocksdb/universal_compaction.h and you can set it in Options::compaction_options_universal. Here we give a short overview of options in CompactionOptionsUniversal:

CompactionOptionsUniversal::size_ratio - Percentage flexibility while comparing file size. If the candidate file(s) size is 1% smaller than the next file's size, then include next file into this candidate set. Default: 1
CompactionOptionsUniversal::min_merge_width - The minimum number of files in a single compaction run. Default: 2
CompactionOptionsUniversal::max_merge_width - The maximum number of files in a single compaction run. Default: UINT_MAX
CompactionOptionsUniversal::max_size_amplification_percent - The size amplification is defined as the amount (in percentage) of additional storage needed to store a single byte of data in the database. For example, a size amplification of 2% means that a database that contains 100 bytes of user-data may occupy upto 102 bytes of physical storage. By this definition, a fully compacted database has a size amplification of 0%. Rocksdb uses the following heuristic to calculate size amplification: it assumes that all files excluding the earliest file contribute to the size amplification. Default: 200, which means that a 100 byte database could require upto 300 bytes of storage.
CompactionOptionsUniversal::compression_size_percent - If this option is set to be -1 (the default value), all the output files will follow compression type specified. If this option is not negative, we will try to make sure compressed size is just above this value. In normal cases, at least this percentage of data will be compressed. When we are compacting to a new file, here is the criteria whether it needs to be compressed: assuming here are the list of files sorted by generation time: [ A1...An B1...Bm C1...Ct ], where A1 is the newest and Ct is the oldest, and we are going to compact B1...Bm, we calculate the total size of all the files as total_size, as well as the total size of C1...Ct as total_C, the compaction output file will be compressed iff total_C / total_size < this percentage
CompactionOptionsUniversal::stop_style - The algorithm used to stop picking files into a single compaction run. Can be kCompactionStopStyleSimilarSize (pick files of similar size) or kCompactionStopStyleTotalSize (total size of picked files > next file). Default: kCompactionStopStyleTotalSize
CompactionOptionsUniversal::allow_trivial_move - Option to optimize the universal multi level compaction by enabling trivial move for non overlapping files. Default: false.

FIFO Compaction Style

See FIFO compaction style

Thread pools

Compactions are executed in thread pools. See Thread Pool.

?著作權歸作者所有,轉載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容