原網(wǎng)址:https://github.com/facebook/rocksdb/wiki/Block-Cache
(有道)
Block cache is where RocksDB caches data in memory for reads. User can pass in a Cache object to a RocksDB instance with a desired capacity (size). A Cache object can be shared by multiple RocksDB instances in the same process, allowing users to control the overall cache capacity. The block cache stores uncompressed blocks. Optionally user can set a second block cache storing compressed blocks. Reads will fetch data blocks first from uncompressed block cache, then compressed block cache. The compressed block cache can be a replacement of OS page cache, if [[Direct-IO]] is used.
塊緩存是RocksDB為讀操作在內(nèi)存中緩存數(shù)據(jù)的地方。用戶可以將一個(gè)Cache對(duì)象傳遞給RocksDB實(shí)例,該對(duì)象具有所需的容量(size)。一個(gè)Cache對(duì)象可以被同一進(jìn)程中的多個(gè)RocksDB實(shí)例共享,允許用戶控制整體的Cache容量。塊緩存存儲(chǔ)未壓縮的塊。用戶可以選擇設(shè)置第二個(gè)塊緩存存儲(chǔ)壓縮塊。讀操作將首先從未壓縮塊緩存中獲取數(shù)據(jù)塊,然后再?gòu)膲嚎s塊緩存中獲取數(shù)據(jù)塊。壓縮塊緩存可以替代OS頁面緩存,如果使用[[Direct-IO]]。
There are two cache implementations in RocksDB, namely LRUCache and ClockCache. Both types of the cache are sharded to mitigate lock contention. Capacity is divided evenly to each shard and shards don't share capacity. By default each cache will be sharded into at most 64 shards, with each shard has no less than 512k bytes of capacity.
RocksDB中有兩個(gè)緩存實(shí)現(xiàn),分別是LRUCache和ClockCache。這兩種類型的緩存都被分片,以緩解鎖爭(zhēng)用。容量平均分配給每個(gè)分片,分片不共享容量。默認(rèn)情況下,每個(gè)cache最多分片為64個(gè)shard,每個(gè)shard的容量不小于512k字節(jié)。
Usage
Out of box, RocksDB will use LRU-based block cache implementation with 8MB capacity. To set a customized block cache, call NewLRUCache() or NewClockCache() to create a cache object, and set it to block based table options. Users can also have their own cache implementation by implementing the Cache interface.
RocksDB將使用8MB容量的基于lru的塊緩存實(shí)現(xiàn)。要設(shè)置一個(gè)定制的塊緩存,調(diào)用NewLRUCache()或NewClockCache()來創(chuàng)建一個(gè)緩存對(duì)象,并將其設(shè)置為基于表選項(xiàng)的塊。用戶還可以通過實(shí)現(xiàn)cache接口擁有自己的緩存實(shí)現(xiàn)。
std::shared_ptr<Cache> cache = NewLRUCache(capacity);
BlockBasedTableOptions table_options;
table_options.block_cache = cache;
Options options;
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
To set compressed block cache:
設(shè)置壓縮塊緩存。
table_options.block_cache_compressed = another_cache;
RocksDB will create the default block cache if block_cache is set to nullptr. To disable block cache completely:
如果block_cache設(shè)置為nullptr, RocksDB將創(chuàng)建默認(rèn)的塊緩存。完全禁用塊緩存:
table_options.no_block_cache = true;
LRU Cache
Out of box, RocksDB will use LRU-based block cache implementation with 8MB capacity. Each shard of the cache maintains its own LRU list and its own hash table for lookup. Synchronization is done via a per-shard mutex. Both lookup and insert to the cache would require a locking mutex of the shard. User can create a LRU cache by calling NewLRUCache(). The function provides several useful options to set to the cache:
RocksDB將使用8MB容量的基于lru的塊緩存實(shí)現(xiàn)。緩存的每個(gè)碎片都維護(hù)自己的LRU列表和自己的哈希表,以供查找。同步是通過每個(gè)分片的互斥完成的。對(duì)緩存的查找和插入都需要一個(gè)分片的鎖互斥。用戶可以通過調(diào)用NewLRUCache()來創(chuàng)建LRU緩存。該函數(shù)提供了幾個(gè)有用的選項(xiàng)來設(shè)置緩存:
capacity: Total size of the cache.
緩存的總大小。num_shard_bits: The number of bits from cache keys to be use as shard id. The cache will be sharded into2^num_shard_bitsshards.
作為分片id的緩存鍵的比特?cái)?shù)。緩存將被分片為2^num_shard_bits分片。strict_capacity_limit: In rare case, block cache size can go larger than its capacity. This is when ongoing reads or iterations over DB pin blocks in block cache, and the total size of pinned blocks exceeds the capacity. If there are further reads which try to insert blocks into block cache, ifstrict_capacity_limit=false(default), the cache will fail to respect its capacity limit and allow the insertion. This can create undesired OOM error that crashes the DB if the host don't have enough memory. Setting the option totruewill reject further insertion to the cache and fail the read or iteration. The option works on per-shard basis, means it is possible one shard is rejecting insert when it is full, while another shard still have extra unpinned space.
strict_capacity_limit:在極少數(shù)情況下,塊緩存的大小可能會(huì)超過容量。這是指在塊緩存中對(duì)DB引腳塊進(jìn)行讀取或迭代,且固定塊的總大小超過了容量。如果有進(jìn)一步的讀操作試圖向塊緩存中插入塊,如果strict_capacity_limit=false(默認(rèn)),緩存將不遵守其容量限制并允許插入。如果主機(jī)沒有足夠的內(nèi)存,這可能會(huì)產(chǎn)生不希望的OOM錯(cuò)誤,導(dǎo)致數(shù)據(jù)庫崩潰。將該選項(xiàng)設(shè)置為true將拒絕進(jìn)一步插入緩存,并使讀取或迭代失敗。該選項(xiàng)以每個(gè)切分為基礎(chǔ),意味著可能一個(gè)切分在空間滿時(shí)拒絕插入,而另一個(gè)切分仍然有額外的未固定空間。high_pri_pool_ratio: The ratio of capacity reserved for high priority blocks. See [[Caching Index, Filter, and Compression Dictionary Blocks|Block-Cache#caching-index-filter-and-compression-dictionary-blocks]] section below for more information.
預(yù)留給高優(yōu)先級(jí)塊的容量比例。更多信息請(qǐng)參見下面的[[Caching Index, Filter, and Compression Dictionary Blocks|Block-Cache# Caching - Index - Filter -and- Compression - Dictionary - Blocks]]小節(jié)。
Clock Cache
WARNING: The ClockCache implementation has at least one remaining bug that could lead to crash or data corruption. Please do not use ClockCache until this is fixed.
警告:ClockCache實(shí)現(xiàn)至少還有一個(gè)可能導(dǎo)致崩潰或數(shù)據(jù)損壞的bug。在此問題解決之前,請(qǐng)不要使用ClockCache。
ClockCache implements the CLOCK algorithm. Each shard of clock cache maintains a circular list of cache entries. A clock handle runs over the circular list looking for unpinned entries to evict, but also giving each entry a second chance to stay in cache if it has been used since last scan. A tbb::concurrent_hash_map is used for lookup.
ClockCache實(shí)現(xiàn)了CLOCK算法。時(shí)鐘緩存的每個(gè)碎片維護(hù)一個(gè)循環(huán)的緩存條目列表。時(shí)鐘句柄在循環(huán)列表中運(yùn)行,查找要清除的未固定條目,但如果上次掃描之后使用了每個(gè)條目,它也給每個(gè)條目第二次機(jī)會(huì)留在緩存中。tbb::concurrent_hash_map用于查找。
The benefit over LRUCache is it has finer-granularity locking. In case of LRU cache, the per-shard mutex has to be locked even on lookup, since it needs to update its LRU-list. Looking up from a clock cache won't require locking per-shard mutex, but only looking up the concurrent hash map, which has fine-granularity locking. Only inserts needs to lock the per-shard mutex. With clock cache we see boost of read throughput over LRU cache in contented environment (see inline comments in cache/clock_cache.cc for benchmark setup):
與LRUCache相比,它的優(yōu)點(diǎn)是具有更細(xì)粒度的鎖定。在LRU緩存的情況下,即使在查找時(shí),每個(gè)分片的互斥鎖也必須被鎖定,因?yàn)樗枰滤腖RU列表。從時(shí)鐘緩存查找時(shí),不需要對(duì)每個(gè)分片的互斥鎖,而只需要查找具有細(xì)粒度鎖定的并發(fā)哈希映射。只有插入需要鎖定每個(gè)分片的互斥。有了時(shí)鐘緩存,我們可以看到在滿足的環(huán)境下,通過LRU緩存讀取吞吐量的提高(參見cache/clock_cache中的內(nèi)聯(lián)注釋)。Cc用于基準(zhǔn)設(shè)置):
Threads Cache Cache ClockCache LRUCache
Size Index/Filter Throughput(MB/s) Hit Throughput(MB/s) Hit
32 2GB yes 466.7 85.9% 433.7 86.5%
32 2GB no 529.9 72.7% 532.7 73.9%
32 64GB yes 649.9 99.9% 507.9 99.9%
32 64GB no 740.4 99.9% 662.8 99.9%
16 2GB yes 278.4 85.9% 283.4 86.5%
16 2GB no 318.6 72.7% 335.8 73.9%
16 64GB yes 391.9 99.9% 353.3 99.9%
16 64GB no 433.8 99.8% 419.4 99.8%
To create a clock cache, call NewClockCache(). To make clock cache available, RocksDB needs to be linked with Intel TBB library. Again there are several options users can set when creating a clock cache:
要?jiǎng)?chuàng)建時(shí)鐘緩存,請(qǐng)調(diào)用NewClockCache()。為了使時(shí)鐘緩存可用,RocksDB需要與Intel TBB庫鏈接。在創(chuàng)建時(shí)鐘緩存時(shí),用戶可以設(shè)置以下幾個(gè)選項(xiàng):
-
capacity: Same as LRUCache. -
num_shard_bits: Same as LRUCache. -
strict_capacity_limit: Same as LRUCache.
Caching Index, Filter, and Compression Dictionary Blocks
By default index, filter, and compression dictionary blocks (with the exception of the partitions of partitioned indexes/filters) are cached outside of block cache, and users won't be able to control how much memory should be used to cache these blocks, other than setting max_open_files. Users can opt to cache index and filter blocks in block cache, which allows for better control of memory used by RocksDB. To cache index, filter, and compression dictionary blocks in block cache:
默認(rèn)情況下,索引、過濾器和壓縮字典塊(除了分區(qū)索引/過濾器的分區(qū))緩存在塊緩存之外,用戶不能控制應(yīng)該使用多少內(nèi)存來緩存這些塊,而只能設(shè)置max_open_files。用戶可以選擇在塊緩存中緩存索引和過濾塊,這樣可以更好地控制RocksDB使用的內(nèi)存。在塊緩存中緩存索引、過濾和壓縮字典塊:
BlockBasedTableOptions table_options;
table_options.cache_index_and_filter_blocks = true;
Note that the partitions of partitioned indexes/filters are as a rule stored in the block cache, regardless of the value of the above option.
請(qǐng)注意,無論上面選項(xiàng)的值是多少,分區(qū)索引/過濾器的分區(qū)作為一個(gè)規(guī)則存儲(chǔ)在塊緩存中。
By putting index, filter, and compression dictionary blocks in block cache, these blocks have to compete against data blocks for staying in cache. Although index and filter blocks are being accessed more frequently than data blocks, there are scenarios where these blocks can be thrashing. This is undesired because index and filter blocks tend to be much larger than data blocks, and they are usually of higher value to stay in cache (the latter is also true for compression dictionary blocks). There are two options to tune to mitigate the problem:
通過將索引、過濾和壓縮字典塊放入塊緩存中,這些塊必須與數(shù)據(jù)塊競(jìng)爭(zhēng),以便保留在緩存中。盡管索引和篩選器塊比數(shù)據(jù)塊被訪問得更頻繁,但在某些情況下,這些塊可能會(huì)出現(xiàn)抖動(dòng)。這是不希望的,因?yàn)樗饕瓦^濾器塊往往比數(shù)據(jù)塊大得多,并且它們通常在緩存中有更高的值(壓縮字典塊也是如此)。有兩種調(diào)優(yōu)方法可以緩解這個(gè)問題:
cache_index_and_filter_blocks_with_high_priority: Set priority to high for index, filter, and compression dictionary blocks in block cache. For partitioned indexes/filters, this affects the priority of the partitions as well. It only affectLRUCacheso far, and need to use together withhigh_pri_pool_ratiowhen callingNewLRUCache(). If the feature is enabled, LRU-list in LRU cache will be split into two parts, one for high-pri blocks and one for low-pri blocks. Data blocks will be inserted to the head of low-pri pool. Index, filter, and compression dictionary blocks will be inserted to the head of high-pri pool. If the total usage in the high-pri pool exceedcapacity * high_pri_pool_ratio, the block at the tail of high-pri pool will overflow to the head of low-pri pool, after which it will compete against data blocks to stay in cache. Eviction will start from the tail of low-pri pool.
將塊緩存中的索引、過濾和壓縮字典塊的優(yōu)先級(jí)設(shè)置為高。對(duì)于分區(qū)索引/過濾器,這也會(huì)影響分區(qū)的優(yōu)先級(jí)。到目前為止,它只影響LRUCache,在調(diào)用NewLRUCache()時(shí)需要與high_pri_pool_ratio一起使用。如果啟用該特性,LRU緩存中的LRU列表將被分成兩部分,一部分用于高優(yōu)先級(jí)塊,另一部分用于低優(yōu)先級(jí)塊。數(shù)據(jù)塊將被插入到低優(yōu)先級(jí)池的頭。索引、過濾器和壓縮字典塊將被插入到高優(yōu)先級(jí)池的頭部。如果高優(yōu)先級(jí)池的總使用量超過容量* high_pri_pool_ratio,高優(yōu)先級(jí)池尾部的塊將溢出到低優(yōu)先級(jí)池的頭部,在此之后,它將與數(shù)據(jù)塊競(jìng)爭(zhēng)留在緩存中。驅(qū)逐將從低優(yōu)先級(jí)泳池的尾部開始。pin_l0_filter_and_index_blocks_in_cache: Pin level-0 file's index and filter blocks in block cache, to avoid them from being evicted. Starting with RocksDB version 6.4, this option also affects compression dictionary blocks. Level-0 index and filters are typically accessed more frequently. Also they tend to be smaller in size so hopefully pinning them in cache won't consume too much capacity.
0級(jí)文件的索引和過濾塊在塊緩存中,以避免他們被驅(qū)逐。從RocksDB 6.4版本開始,這個(gè)選項(xiàng)也會(huì)影響壓縮字典塊。級(jí)別0的索引和過濾器通常被更頻繁地訪問。此外,它們的大小往往更小,所以希望將它們固定在緩存中不會(huì)消耗太多的容量。pin_top_level_index_and_filter: only applicable to partitioned indexes/filters. Iftrue, the top level of the partitioned index/filter structure will be pinned in the cache, regardless of the LSM tree level (that is, unlike the previous option, this affects files on all LSM tree levels, not just L0).
僅適用于分區(qū)索引/過濾器。如果為真值,則分區(qū)索引/過濾器結(jié)構(gòu)的頂層將固定在緩存中,而不考慮LSM樹的級(jí)別(也就是說,與前一個(gè)選項(xiàng)不同,這將影響所有LSM樹級(jí)別的文件,而不僅僅是L0)。
Simulated Cache
SimCache is an utility to predict cache hit rate if cache capacity or number of shards is changed. It wraps around the real Cache object that the DB is using, and runs a shadow LRU cache simulating the given capacity and number of shards, and measure cache hits and misses of the shadow cache. The utility is useful when user wants to open a DB with, say, 4GB cache size, but would like to know what the cache hit rate will become if cache size enlarge to, say, 64GB. To create a simulated cache:
SimCache是一個(gè)實(shí)用程序,可以在緩存容量或碎片數(shù)量發(fā)生變化時(shí)預(yù)測(cè)緩存命中率。它包裹了DB正在使用的真實(shí)Cache對(duì)象,并運(yùn)行一個(gè)影子LRU緩存,模擬給定的容量和碎片數(shù)量,并測(cè)量影子緩存的緩存命中和未命中。當(dāng)用戶想要打開一個(gè)緩存大小為4GB的數(shù)據(jù)庫,但又想知道如果緩存大小擴(kuò)大到64GB,緩存命中率將會(huì)如何時(shí),這個(gè)實(shí)用程序很有用。創(chuàng)建一個(gè)模擬緩存。
// This cache is the actual cache use by the DB.
std::shared_ptr<Cache> cache = NewLRUCache(capacity);
// This is the simulated cache.
std::shared_ptr<Cache> sim_cache = NewSimCache(cache, sim_capacity, sim_num_shard_bits);
BlockBasedTableOptions table_options;
table_options.block_cache = sim_cache;
The extra memory overhead of the simulated cache is less than 2% of sim_capacity.
模擬緩存的額外內(nèi)存開銷小于sim_capacity的2%。
Statistics
A list of block cache counters can be accessed through Options.statistics if it is non-null.
可以通過Options訪問塊緩存計(jì)數(shù)器列表。非空時(shí)的統(tǒng)計(jì)信息。
// total block cache misses
// REQUIRES: BLOCK_CACHE_MISS == BLOCK_CACHE_INDEX_MISS +
// BLOCK_CACHE_FILTER_MISS +
// BLOCK_CACHE_DATA_MISS;
BLOCK_CACHE_MISS = 0,
// total block cache hit
// REQUIRES: BLOCK_CACHE_HIT == BLOCK_CACHE_INDEX_HIT +
// BLOCK_CACHE_FILTER_HIT +
// BLOCK_CACHE_DATA_HIT;
BLOCK_CACHE_HIT,
// # of blocks added to block cache.
BLOCK_CACHE_ADD,
// # of failures when adding blocks to block cache.
BLOCK_CACHE_ADD_FAILURES,
// # of times cache miss when accessing index block from block cache.
BLOCK_CACHE_INDEX_MISS,
// # of times cache hit when accessing index block from block cache.
BLOCK_CACHE_INDEX_HIT,
// # of times cache miss when accessing filter block from block cache.
BLOCK_CACHE_FILTER_MISS,
// # of times cache hit when accessing filter block from block cache.
BLOCK_CACHE_FILTER_HIT,
// # of times cache miss when accessing data block from block cache.
BLOCK_CACHE_DATA_MISS,
// # of times cache hit when accessing data block from block cache.
BLOCK_CACHE_DATA_HIT,
// # of bytes read from cache.
BLOCK_CACHE_BYTES_READ,
// # of bytes written into cache.
BLOCK_CACHE_BYTES_WRITE,
See also: [[Memory-usage-in-RocksDB#block-cache]]