翻譯 RocksDB Overview

網(wǎng)址 https://github.com/facebook/rocksdb/wiki/RocksDB-Overview

有道

RocksDB Overview

https://github.com/facebook/rocksdb.wiki.git

1. Introduction

RocksDB started at Facebook as a storage engine for server workloads on various storage media, with the initial focus on fast storage (especially Flash storage). It is a C++ library to store keys and values, which are arbitrarily-sized byte streams. It supports both point lookups and range scans, and provides different types of ACID guarantees.
RocksDB最初在Facebook時是一個存儲引擎,用于在各種存儲媒體上處理服務(wù)器工作負(fù)載,最初專注于快速存儲(尤其是Flash存儲)。 它是一個c++標(biāo)準(zhǔn)庫,用于存儲鍵和值,它們是任意大小的字節(jié)流。 它支持點查找和范圍掃描,并提供不同類型的ACID保證。

A balance is struck between customizability and self-adaptability. RocksDB features highly flexible configuration settings that may be tuned to run on a variety of production environments, including SSDs, hard disks, ramfs, or remote storage. It supports various compression algorithms and good tools for production support and debugging. On the other hand, efforts are also made to limit the number of knobs, to provide good enough out-of-box performance, and to use some adaptive algorithms wherever applicable.
在可定制性和自適應(yīng)性之間取得了平衡。 RocksDB具有高度靈活的配置設(shè)置,可以調(diào)優(yōu)以運行在各種生產(chǎn)環(huán)境中,包括ssd、硬盤、ramfs或遠程存儲。 它支持各種壓縮算法和用于生產(chǎn)支持和調(diào)試的好工具。 另一方面,還努力限制旋鈕的數(shù)量,以提供足夠好的開箱即用性能,并在適用的地方使用一些自適應(yīng)算法。

RocksDB borrows significant code from the open source leveldb project as well as ideas from Apache HBase. The initial code was forked from open source leveldb 1.5. It also builds upon code and ideas that were developed at Facebook before RocksDB.
RocksDB借鑒了開源leveldb項目的大量代碼以及Apache HBase的想法。 最初的代碼是從開源的leveldb 1.5派生出來的。 它也建立在Facebook在RocksDB之前開發(fā)的代碼和想法之上。

2. Assumptions and Goals

Performance:

The primary design point for RocksDB is that it should be performant for fast storage and for server workloads. It should support efficient point lookups as well as range scans. It should be configurable to support high random-read workloads, high update workloads or a combination of both. Its architecture should support easy tuning of trade-offs for different workloads and hardware.
RocksDB的主要設(shè)計要點是,它應(yīng)該為快速存儲和服務(wù)器工作負(fù)載提供性能。它應(yīng)該支持高效的點查找和范圍掃描。它應(yīng)該是可配置的,以支持高隨機讀工作負(fù)載、高更新工作負(fù)載或兩者的組合。它的架構(gòu)應(yīng)該支持針對不同的工作負(fù)載和硬件進行輕松的權(quán)衡調(diào)整。

Production Support:

RocksDB should be designed in such a way that it has built-in support for tools and utilities that help deployment and debugging in production environments. If the storage engine cannot yet be able to automatically adapt the application and hardware, we will provide some parameters to allow users to tune performance.
RocksDB的設(shè)計應(yīng)該內(nèi)置對工具和實用程序的支持,以幫助在生產(chǎn)環(huán)境中進行部署和調(diào)試。如果存儲引擎還不能自動適應(yīng)應(yīng)用程序和硬件,我們將提供一些參數(shù)來允許用戶調(diào)優(yōu)性能。

Compatibility:

Newer versions of this software should be backward compatible, so that existing applications do not need to change when upgrading to newer releases of RocksDB. Unless using newly provided features, existing applications also should be able to revert to a recent old release. See RocksDB Compatibility Between Different Releases.
該軟件的新版本應(yīng)該是向后兼容的,這樣在升級到RocksDB的新版本時,現(xiàn)有的應(yīng)用程序不需要更改。除非使用新提供的特性,否則現(xiàn)有的應(yīng)用程序也應(yīng)該能夠恢復(fù)到最近的舊版本。參見RocksDB不同版本之間的兼容性。

3. High Level Architecture

RocksDB is a storage engine library of key-value store interface where keys and values are arbitrary byte streams. RocksDB organizes all data in sorted order and the common operations are Get(key), NewIterator(), Put(key, val), Delete(key), and SingleDelete(key).
RocksDB是一個鍵-值存儲接口的存儲引擎庫,其中鍵和值是任意的字節(jié)流。RocksDB按照排序的順序組織所有數(shù)據(jù),常用的操作有Get(key)、NewIterator()、Put(key, val)、Delete(key)、SingleDelete(key)。

The three basic constructs of RocksDB are memtable, sstfile and logfile. The memtable is an in-memory data structure - new writes are inserted into the memtable and are optionally written to the logfile (aka. Write Ahead Log(WAL)). The logfile is a sequentially-written file on storage. When the memtable fills up, it is flushed to a sstfile on storage and the corresponding logfile can be safely deleted. The data in an sstfile is sorted to facilitate easy lookup of keys.
RocksDB的三個基本結(jié)構(gòu)是memtable, sstfile和logfile。memtable是一個內(nèi)存中的數(shù)據(jù)結(jié)構(gòu)——新的寫操作被插入到memtable中,并且可以選擇寫入日志文件。寫日志(細(xì)胞膜))。日志文件是一個順序?qū)懭氪鎯Φ奈募?。?dāng)memtable被填滿時,它被刷新到存儲上的sstfile,相應(yīng)的日志文件可以被安全地刪除。sstfile中的數(shù)據(jù)進行了排序,以便于方便地查找鍵。

The default format of sstfile is described in more details here.
這里詳細(xì)描述了sstfile的默認(rèn)格式。

image.png

4. Features

Column Families

RocksDB supports partitioning a database instance into multiple column families. All databases are created with a column family named "default", which is used for operations where column family is unspecified.
RocksDB支持將數(shù)據(jù)庫實例劃分為多個列族。所有數(shù)據(jù)庫都使用名為“default”的列族創(chuàng)建,該列族用于未指定列族的操作。

RocksDB guarantees users a consistent view across column families, including after crash recovery when WAL is enabled or atomic flush is enabled. It also supports atomic cross-column family operations via the WriteBatch API.
RocksDB保證用戶跨列族獲得一致的視圖,包括啟用WAL或啟用原子刷新時的崩潰恢復(fù)。它還通過WriteBatch API支持原子跨列族操作。

Updates

A Put API inserts a single key-value to the database. If the key already exists in the database, the previous value will be overwritten. A Write API allows multiple keys-values to be atomically inserted, updated, or deleted in the database. The database guarantees that either all of the keys-values in a single Write call will be inserted into the database or none of them will be inserted into the database. If any of those keys already exist in the database, previous values will be overwritten. DeleteRange API can be used to delete all keys from a range.
Put API向數(shù)據(jù)庫插入一個鍵值。如果該鍵已經(jīng)存在于數(shù)據(jù)庫中,那么以前的值將被覆蓋。Write API允許在數(shù)據(jù)庫中原子地插入、更新或刪除多個鍵值。數(shù)據(jù)庫保證將單個Write調(diào)用中的所有鍵值插入到數(shù)據(jù)庫中,或者不將任何鍵值插入到數(shù)據(jù)庫中。如果這些鍵中的任何一個已經(jīng)存在于數(shù)據(jù)庫中,那么以前的值將被覆蓋。DeleteRange API可用于刪除一個范圍內(nèi)的所有鍵。

Gets, Iterators and Snapshots

Keys and values are treated as pure byte streams. There is no limit to the size of a key or a value. The Get API allows an application to fetch a single key-value from the database. The MultiGet API allows an application to retrieve a bunch of keys from the database. All the keys-values returned via a MultiGet call are consistent with one-another.
鍵和值被視為純字節(jié)流。鍵或值的大小沒有限制。Get API允許應(yīng)用程序從數(shù)據(jù)庫中獲取單個鍵值。MultiGet API允許應(yīng)用程序從數(shù)據(jù)庫中檢索一串鍵。通過MultiGet調(diào)用返回的所有鍵值都是一致的。

All data in the database is logically arranged in sorted order. An application can specify a key comparison method that specifies a total ordering of keys. An Iterator API allows an application to do a range scan on the database. The Iterator can seek to a specified key and then the application can start scanning one key at a time from that point. The Iterator API can also be used to do a reverse iteration of the keys in the database. A consistent-point-in-time view of the database is created when the Iterator is created. Thus, all keys returned via the Iterator are from a consistent view of the database.
數(shù)據(jù)庫中的所有數(shù)據(jù)在邏輯上按順序排列。應(yīng)用程序可以指定一個鍵比較方法,該方法指定鍵的總順序。Iterator API允許應(yīng)用程序?qū)?shù)據(jù)庫進行范圍掃描。Iterator可以查找指定的鍵,然后應(yīng)用程序可以從這個點開始一次掃描一個鍵。還可以使用Iterator API對數(shù)據(jù)庫中的鍵進行反向迭代。在創(chuàng)建迭代器時,將創(chuàng)建數(shù)據(jù)庫的一致時間點視圖。因此,通過Iterator返回的所有鍵都來自數(shù)據(jù)庫的一致視圖。

A Snapshot API allows an application to create a point-in-time view of a database. The Get and Iterator APIs can be used to read data from a specified snapshot. In a sense, a Snapshot and an Iterator both provide a point-in-time view of the database, but their implementations are different. Short-lived/foreground scans are best done via an iterator while long-running/background scans are better done via a snapshot. An Iterator keeps a reference count on all underlying files that correspond to that point-in-time-view of the database - these files are not deleted until the Iterator is released. A Snapshot, on the other hand, does not prevent file deletions; instead the compaction process understands the existence of Snapshots and promises never to delete a key that is visible in any existing Snapshot.
Snapshot API允許應(yīng)用程序創(chuàng)建一個數(shù)據(jù)庫的時間點視圖。Get和Iterator api可用于從指定的快照讀取數(shù)據(jù)。從某種意義上說,Snapshot和Iterator都提供了數(shù)據(jù)庫的時間點視圖,但它們的實現(xiàn)是不同的。短期掃描/前臺掃描最好通過迭代器完成,而長期掃描/后臺掃描最好通過快照完成。迭代器會對與該數(shù)據(jù)庫的時間點視圖對應(yīng)的所有底層文件保持引用計數(shù)——這些文件在迭代器釋放之前不會被刪除。另一方面,快照不阻止文件刪除;相反,壓縮過程理解快照的存在,并承諾永遠不會刪除任何現(xiàn)有快照中可見的鍵。

Snapshots are not persisted across database restarts: a reload of the RocksDB library (via a server restart) releases all pre-existing Snapshots.
快照在數(shù)據(jù)庫重啟時不會被持久化:重新加載RocksDB庫(通過服務(wù)器重啟)會釋放所有現(xiàn)有的快照。

Transactions

RocksDB supports multi-operational transactions. It supports both of optimistic and pessimistic mode. See Transactions.
RocksDB支持多操作事務(wù)。它支持樂觀模式和悲觀模式??吹浇灰?。

Prefix Iterators

Most LSM-tree engines cannot support an efficient range scan API because it needs to look into multiple data files. But, most applications do not do pure-random scans of key ranges in the database; instead, applications typically scan within a key-prefix. RocksDB uses this to its advantage. Applications can configure a Options.prefix_extractor to enable a key-prefix based filtering. When Options.prefix_extractor is set, a hash of the prefix is also added to the Bloom. An Iterator that specifies a key-prefix (in ReadOptions) will use the Bloom Filter to avoid looking into data files that do not contain keys with the specified key-prefix. See Prefix-Seek.
大多數(shù)LSM-tree引擎不能支持有效的范圍掃描API,因為它需要查看多個數(shù)據(jù)文件。但是,大多數(shù)應(yīng)用程序不會對數(shù)據(jù)庫中的鍵范圍進行純隨機掃描;相反,應(yīng)用程序通常在一個鍵-前綴內(nèi)進行掃描。RocksDB利用了這一點。應(yīng)用程序可以配置一個選項。Prefix_extractor來啟用基于鍵-前綴的過濾。當(dāng)選擇。設(shè)置prefix_extractor后,還將該前綴的哈希值添加到Bloom中。指定鍵前綴的迭代器(在ReadOptions中)將使用Bloom Filter來避免查找不包含指定鍵前綴的鍵的數(shù)據(jù)文件??吹絇refix-Seek。

Persistence

RocksDB has a Write Ahead Log (WAL). All write operations (Put, Delete and Merge) are stored in an in-memory buffer called the memtable as well as optionally inserted into WAL. On restart, it re-processes all the transactions that were recorded in the log.
RocksDB有一個預(yù)寫日志(Write Ahead Log, WAL)。所有寫操作(Put, Delete和Merge)都存儲在內(nèi)存緩沖區(qū)中,稱為memtable,也可以插入到WAL中。在重新啟動時,它會重新處理日志中記錄的所有事務(wù)。

WAL can be configured to be stored in a directory different from the directory where the SST files are stored. This is necessary for those cases in which you might want to store all data files in non-persistent fast storage. At the same time, you can ensure no data loss by putting all transaction logs on slower but persistent storage.
可以將WAL配置為存儲在與SST文件存儲目錄不同的目錄中。對于希望將所有數(shù)據(jù)文件存儲在非持久快速存儲中的情況,這是必要的。同時,您可以通過將所有事務(wù)日志放在較慢但持久的存儲中來確保沒有數(shù)據(jù)丟失。

Each Put has a flag, set via WriteOptions, which specifies whether or not the Put should be inserted into the transaction log. The WriteOptions may also specify whether or not a fsync call is issued to the transaction log before a Put is declared to be committed.
每個Put都有一個通過WriteOptions設(shè)置的標(biāo)志,該標(biāo)志指定是否應(yīng)該將Put插入事務(wù)日志中。WriteOptions還可以指定在聲明Put被提交之前是否向事務(wù)日志發(fā)出fsync調(diào)用。

Internally, RocksDB uses a batch-commit mechanism to batch transactions into the log so that it can potentially commit multiple transactions using a single fsync call.
在內(nèi)部,RocksDB使用批提交機制將事務(wù)批處理到日志中,這樣它就可以使用一個fsync調(diào)用提交多個事務(wù)。

Data Checksuming

RocksDB uses a checksum to detect corruptions in storage. These checksums are for each SST file block (typically between 4K to 128K in size). A block, once written to storage, is never modified. RocksDB also maintains a Full File Checksum.
RocksDB使用校驗和來檢測存儲中的損壞。這些校驗和針對每個SST文件塊(大小通常在4K到128K之間)。塊一旦寫入存儲器,就永遠不會被修改。RocksDB還維護一個完整文件校驗和。

RocksDB dynamically detects and utilizes CPU checksum offload support.
RocksDB動態(tài)檢測并利用CPU校驗和卸載支持。

Multi-Threaded Compactions

In the presence of ongoing writes, compactions are needed for space efficiency, read (query) efficiency, and timely data deletion. Compaction removes key-value bindings that have been deleted or overwritten, and re-organizes data for query efficiency. Compactions may occur in multiple threads if configured.
在進行寫操作的情況下,為了提高空間效率、讀取(查詢)效率和及時刪除數(shù)據(jù),需要進行壓縮。壓縮刪除已刪除或覆蓋的鍵值綁定,并重新組織數(shù)據(jù)以提高查詢效率。如果配置,壓縮可能在多個線程中發(fā)生。

The entire database is stored in a set of sstfiles. When a memtable is full, its content is written out to a file in Level-0 (L0) of the LSM tree. RocksDB removes duplicate and overwritten keys in the memtable when it is flushed to a file in L0. In compaction, some files are periodically read in and merged to form larger files, often going into the next LSM level (such as L1, up to Lmax).
整個數(shù)據(jù)庫存儲在一組sstfile中。當(dāng)一個memtable被填滿時,它的內(nèi)容被寫到LSM樹的0級(L0)文件中。當(dāng)將memtable刷新到L0中的一個文件時,RocksDB刪除了memtable中重復(fù)和覆蓋的鍵。在壓縮過程中,定期讀入一些文件并合并成更大的文件,通常進入下一個LSM級別(例如L1到Lmax)。

The overall write throughput of an LSM database directly depends on the speed at which compactions can occur, especially when the data is stored in fast storage like SSD or RAM. RocksDB may be configured to issue concurrent compaction requests from multiple threads. It is observed that sustained write rates may increase by as much as a factor of 10 with multi-threaded compaction when the database is on SSDs, as compared to single-threaded compactions.
LSM數(shù)據(jù)庫的整體寫吞吐量直接取決于壓縮發(fā)生的速度,特別是當(dāng)數(shù)據(jù)存儲在諸如SSD或RAM之類的快速存儲中時。RocksDB可以配置為從多個線程發(fā)出并發(fā)壓縮請求。可以觀察到,與單線程緊湊相比,在數(shù)據(jù)庫位于ssd上時,使用多線程緊湊可以增加10倍的持續(xù)寫入速率。

Compaction Styles

Both Level Style Compaction and Universal Style Compaction store data in a fixed number of logical levels in the database. More recent data is stored in Level-0 (L0) and older data in higher-numbered levels, up to Lmax. Files in L0 may have overlapping keys, but files in other levels generally form a single sorted run per level.
Level Style Compaction和Universal Style Compaction都將數(shù)據(jù)存儲在數(shù)據(jù)庫中固定數(shù)量的邏輯級別中。最近的數(shù)據(jù)存儲在Level-0 (L0)中,舊的數(shù)據(jù)存儲在更高編號的級別中,一直到Lmax。L0中的文件可能有重疊的鍵,但其他級別的文件通常會在每個級別上單獨有序運行。

Level Style Compaction (default) typically optimizes disk footprint vs. logical database size (space amplification) by minimizing the files involved in each compaction step: merging one file in Ln with all its overlapping files in Ln+1 and replacing them with new files in Ln+1.
級別樣式壓縮(默認(rèn))通常通過最小化每個壓縮步驟中涉及的文件來優(yōu)化磁盤占用空間和邏輯數(shù)據(jù)庫大小(空間放大):將Ln中的一個文件與Ln+1中的所有重疊文件合并,并用Ln+1中的新文件替換它們。

Universal Style Compaction typically optimizes total bytes written to disk vs. logical database size (write amplification) by merging potentially many files and levels at once, requiring more temporary space. Universal typically results in lower write-amplification but higher space- and read-amplification than Level Style Compaction.
Universal Style Compaction通常通過一次合并多個文件和級別來優(yōu)化寫入磁盤的字節(jié)總數(shù)和邏輯數(shù)據(jù)庫大小(寫入放大),這需要更多的臨時空間。與級別樣式壓縮相比,通用壓縮通常會導(dǎo)致更低的寫放大,但更高的空間和讀放大。

FIFO Style Compaction drops oldest file when obsolete and can be used for cache-like data. In FIFO compaction, all files are in level 0. When total size of the data exceeds configured size (CompactionOptionsFIFO::max_table_files_size), we delete the oldest table file.
FIFO風(fēng)格壓縮在過時時會刪除最舊的文件,并可用于緩存類數(shù)據(jù)。在FIFO壓縮中,所有文件都處于0級。當(dāng)數(shù)據(jù)的總大小超過配置的大小(CompactionOptionsFIFO::max_table_files_size)時,我們刪除最老的表文件。

We also enable developers to develop and experiment with custom compaction policies. For this reason, RocksDB has appropriate hooks to switch off the inbuilt compaction algorithm and has other APIs to allow applications to operate their own compaction algorithms. Options.disable_auto_compaction, if set, disables the native compaction algorithm. The GetLiveFilesMetaData API allows an external component to look at every data file in the database and decide which data files to merge and compact. Call CompactFiles to compact files you want. The DeleteFile API allows applications to delete data files that are deemed obsolete.
我們還允許開發(fā)人員開發(fā)和試驗自定義壓縮策略。因此,RocksDB有適當(dāng)?shù)你^子來關(guān)閉內(nèi)置的壓縮算法,并有其他api允許應(yīng)用程序操作自己的壓縮算法。選項。如果設(shè)置了Disable_auto_compaction,則禁用本機壓縮算法。GetLiveFilesMetaData API允許外部組件查看數(shù)據(jù)庫中的每個數(shù)據(jù)文件,并決定合并和壓縮哪些數(shù)據(jù)文件。調(diào)用CompactFiles來壓縮你想要的文件。DeleteFile API允許應(yīng)用程序刪除被認(rèn)為過時的數(shù)據(jù)文件。

Metadata storage

A manifest log file is used to record all the database state changes. The compaction process adds new files and deletes existing files from the database, and it makes these operations persistent by recording them in the MANIFEST.
清單日志文件用于記錄所有數(shù)據(jù)庫狀態(tài)的更改。壓縮過程從數(shù)據(jù)庫中添加新文件并刪除現(xiàn)有文件,并且通過在MANIFEST中記錄這些操作,使這些操作具有持久性。

Avoiding Stalls

Background compaction threads are also used to flush memtable contents to a file on storage. If all background compaction threads are busy doing long-running compactions, then a sudden burst of writes can fill up the memtable(s) quickly, thus stalling new writes. This situation can be avoided by configuring RocksDB to keep a small set of threads explicitly reserved for the sole purpose of flushing memtable to storage.
后臺壓縮線程也用于將memtable內(nèi)容刷新到存儲器上的一個文件中。如果所有的后臺壓縮線程都在忙著執(zhí)行長時間運行的壓縮,那么突發(fā)的寫操作會很快地填滿memtable,從而導(dǎo)致新的寫入延遲。這種情況可以通過配置RocksDB來避免,它會為將memtable刷新到存儲的唯一目的而顯式地保留一小組線程。

Compaction Filter

Some applications may want to process keys at compaction time. For example, a database with inherent support for time-to-live (TTL) may remove expired keys. This can be done via an application-defined Compaction-Filter. If the application wants to continuously delete data older than a specific time, it can use the compaction filter to drop records that have expired. The RocksDB Compaction Filter gives control to the application to modify the value of a key or to drop a key entirely as part of the compaction process. For example, an application can continuously run a data sanitizer as part of the compaction.
一些應(yīng)用程序可能希望在壓縮時處理鍵。例如,固有支持生存時間(TTL)的數(shù)據(jù)庫可以刪除過期的密鑰。這可以通過應(yīng)用程序定義的壓縮過濾器來實現(xiàn)。如果應(yīng)用程序希望不斷刪除比特定時間更早的數(shù)據(jù),它可以使用壓縮過濾器刪除已經(jīng)過期的記錄。RocksDB Compaction Filter讓應(yīng)用程序可以修改一個鍵的值,或者作為壓縮過程的一部分刪除一個鍵。例如,應(yīng)用程序可以在壓縮過程中持續(xù)運行數(shù)據(jù)清理程序。

ReadOnly Mode

A database may be opened in ReadOnly mode, in which the database guarantees that the application may not modify anything in the database. This results in much higher read performance because oft-traversed code paths avoid locks completely.
數(shù)據(jù)庫可以以ReadOnly模式打開,在這種模式下,數(shù)據(jù)庫保證應(yīng)用程序不會修改數(shù)據(jù)庫中的任何內(nèi)容。這將導(dǎo)致更高的讀取性能,因為經(jīng)常遍歷的代碼路徑完全避免了鎖。

Database Debug Logs

By default, RocksDB writes detailed logs to a file named LOG*. These are mostly used for debugging and analyzing a running system. Users can choose different log levels. This LOG may be configured to roll at a specified periodicity. The logging interface is pluggable. Users can plug in a different logger. See Logger.

Data Compression

RocksDB supports lz4, zstd, snappy, zlib, and lz4_hc compression, as well as xpress under Windows. RocksDB may be configured to support different compression algorithms for data at the bottommost level, where 90% of data lives. A typical installation might configure ZSTD (or Zlib if not available) for the bottom-most level and LZ4 (or Snappy if it is not available) for other levels. See Compression.
默認(rèn)情況下,RocksDB會將詳細(xì)日志寫到LOG*文件中。它們主要用于調(diào)試和分析運行中的系統(tǒng)。用戶可以選擇不同的日志級別。可以將此LOG配置為按指定的周期滾動。日志接口是可插拔的。用戶可以插入不同的日志記錄器??吹饺罩居涗浧?。

Full Backups and Replication

RocksDB provides a backup API, BackupEngine. You can read more about it here: How to backup RocksDB
RocksDB提供備份API BackupEngine。你可以在這里閱讀更多:如何備份RocksDB

RocksDB itself is not a replicated, but it provides some helper functions to enable users to implement their replication system on top of RocksDB, see Replication Helpers.
RocksDB本身不是一個復(fù)制的,但它提供了一些幫助函數(shù),使用戶可以在RocksDB的基礎(chǔ)上實現(xiàn)他們的復(fù)制系統(tǒng),參見replication Helpers。

Support for Multiple Embedded Databases in the same process

A common use-case for RocksDB is that applications inherently partition their data set into logical partitions or shards. This technique benefits application load balancing and fast recovery from faults. This means that a single server process should be able to operate multiple RocksDB databases simultaneously. This is done via an environment object named Env. Among other things, a thread pool is associated with an Env. If applications want to share a common thread pool (for background compactions) among multiple database instances, then it should use the same Env object for opening those databases.
RocksDB的一個常見用例是,應(yīng)用程序固有地將其數(shù)據(jù)集劃分為邏輯分區(qū)或分片。這種技術(shù)有利于應(yīng)用程序負(fù)載平衡和快速從故障中恢復(fù)。這意味著一個服務(wù)器進程應(yīng)該能夠同時操作多個RocksDB數(shù)據(jù)庫。這是通過一個名為Env的環(huán)境對象來完成的。其中,線程池與Env相關(guān)聯(lián)。如果應(yīng)用程序希望在多個數(shù)據(jù)庫實例之間共享一個公共線程池(用于后臺壓縮),那么它應(yīng)該使用相同的Env對象來打開這些數(shù)據(jù)庫。

Similarly, multiple database instances may share the same block cache or rate limiter.
類似地,多個數(shù)據(jù)庫實例可能共享相同的塊緩存或速率限制器。

Block Cache -- Compressed and Uncompressed Data

RocksDB uses a LRU cache for blocks to serve reads. The block cache is partitioned into two individual caches: the first caches uncompressed blocks and the second caches compressed blocks in RAM. If a compressed block cache is configured, users may wish to enable direct I/O to prevent redundant caching of the same data in OS page cache.
RocksDB使用LRU緩存塊來提供讀操作。塊緩存被劃分為兩個獨立的緩存:第一個緩存未壓縮的塊,第二個緩存壓縮的塊。如果配置了壓縮塊緩存,用戶可能希望啟用直接I/O,以防止在操作系統(tǒng)頁面緩存中重復(fù)緩存相同的數(shù)據(jù)。

Table Cache

The Table Cache is a construct that caches open file descriptors. These file descriptors are for sstfiles. An application can specify the maximum size of the Table Cache, or configure RocksDB to always keep all files open, to achieve better performance.
Table Cache是一種用來緩存打開文件描述符的結(jié)構(gòu)。這些文件描述符是用于sstfiles的。應(yīng)用程序可以指定Table Cache的最大大小,或者配置RocksDB始終保持所有文件打開,以獲得更好的性能。

I/O Control

RocksDB allows users to configure I/O from and to SST files in different ways. Users can enable direct I/O so that RocksDB takes full control to the I/O and caching. An alternative is to leverage some options to allow users to hint about how I/O should be executed. They can suggest RocksDB to call fadvise in files to read, call periodic range sync in files being appended, enable direct I/O, etc... See IO for more details.
RocksDB允許用戶以不同的方式配置來自和到SST文件的I/O。用戶可以啟用直接I/O,這樣RocksDB就可以完全控制I/O和緩存。另一種方法是利用一些選項允許用戶提示應(yīng)該如何執(zhí)行I/O。他們可以建議RocksDB在文件中調(diào)用fadvise來讀取,在被追加的文件中調(diào)用周期性范圍同步,啟用直接I/O等等……更多細(xì)節(jié)請參見IO。

Stackable DB

RocksDB has a built-in wrapper mechanism to add functionality as a layer above the code database kernel. This functionality is encapsulated by the StackableDB API. For example, the time-to-live functionality is implemented by a StackableDB and is not part of the core RocksDB API. This approach keeps the code modularized and clean.
RocksDB有一個內(nèi)置的包裝器機制,可以在代碼數(shù)據(jù)庫內(nèi)核之上添加一層功能。這個功能由StackableDB API封裝。例如,time-to-live功能是由StackableDB實現(xiàn)的,而不是核心RocksDB API的一部分。這種方法使代碼保持模塊化和干凈。(待分析)

Memtables:

  • Pluggable Memtables:
    The default implementation of the memtable for RocksDB is a skiplist. The skiplist is a sorted set, which is a necessary construct when the workload interleaves writes with range-scans. Some applications do not interleave writes and scans, however, and some applications do not do range-scans at all. For these applications, a sorted set may not provide optimal performance. For this reason, RocksDB's memtable is pluggable. Some alternative implementations are provided. Three memtables are part of the library: a skiplist memtable, a vector memtable and a prefix-hash memtable. A vector memtable is appropriate for bulk-loading data into the database. Every write inserts a new element at the end of the vector; when it is time to flush the memtable to storage the elements in the vector are sorted and written out to a file in L0. A prefix-hash memtable allows efficient processing of gets, puts and scans-within-a-key-prefix. Although the pluggability of memtable is not provided as a public API, it is possible for an application to provide its own implementation of a memtable, in a private fork.
    RocksDB的memtable的默認(rèn)實現(xiàn)是skiplist。skiplist是一個排序的集合,當(dāng)工作負(fù)載與范圍掃描交叉寫入時,這是一個必要的構(gòu)造。然而,有些應(yīng)用程序并不交叉寫入和掃描,有些應(yīng)用程序根本不進行范圍掃描。對于這些應(yīng)用程序,排序集可能不能提供最佳性能。因此,RocksDB的memtable是可插拔的。還提供了一些替代實現(xiàn)。庫中有三個memtable: skiplist memtable、vector memtable和prefix-hash memtable。vector memtable適用于將數(shù)據(jù)批量加載到數(shù)據(jù)庫中。每次寫操作都在vector的末尾插入一個新元素;當(dāng)需要刷新memtable以存儲vector中的元素時,將對其進行排序并寫入L0中的文件。前綴-哈希memtable允許高效地處理key-prefix中的get、puts和scan。雖然memtable的可插拔性不是作為公共API提供的,但應(yīng)用程序可以在私有分支中提供自己的memtable實現(xiàn)。

  • Memtable Pipelining
    RocksDB supports configuring an arbitrary number of memtables for a database. When a memtable is full, it becomes an immutable memtable and a background thread starts flushing its contents to storage. Meanwhile, new writes continue to accumulate to a newly allocated memtable. If the newly allocated memtable is filled up to its limit, it is also converted to an immutable memtable and is inserted into the flush pipeline. The background thread continues to flush all the pipelined immutable memtables to storage. This pipelining increases write throughput of RocksDB, especially when it is operating on slow storage devices.
    RocksDB支持為數(shù)據(jù)庫配置任意數(shù)量的memtables。當(dāng)一個memtable被填滿時,它就變成一個不可變的memtable,一個后臺線程開始將它的內(nèi)容刷新到存儲器中。同時,新寫操作繼續(xù)累積到一個新分配的memtable中。如果新分配的memtable被填滿,它也會被轉(zhuǎn)換為一個不可變的memtable,并被插入到刷新管道中。后臺線程繼續(xù)將管道中所有不可變的memtable刷新到存儲中。這種流水線增加了RocksDB的寫吞吐量,特別是當(dāng)它在低速存儲設(shè)備上運行時。

  • Garbage Collection during Memtable Flush:
    When a memtable is being flushed to storage, an inline-compaction process is executed. Garbages are removed in the same way as compactions. Duplicate updates for the same key are removed from the output stream. Similarly, if an earlier put is hidden by a later delete, then the put is not written to the output file at all. This feature reduces the size of data on storage and write amplification greatly, for some workloads.
    當(dāng)一個memtable被刷新到存儲器時,會執(zhí)行一個內(nèi)聯(lián)壓縮過程。垃圾清除的方法與壓實的方法相同。從輸出流中刪除相同鍵的重復(fù)更新。類似地,如果前面的放置被后面的刪除所隱藏,則該放置根本不會寫入輸出文件。對于某些工作負(fù)載,這個特性減少了存儲上的數(shù)據(jù)大小并大大增加了寫。

Merge Operator

RocksDB natively supports three types of records, a Put record, a Delete record and a Merge record. When a compaction process encounters a Merge record, it invokes an application-specified method called the Merge Operator. The Merge can combine multiple Put and Merge records into a single one. This powerful feature allows applications that typically do read-modify-writes to avoid the reads altogether. It allows an application to record the intent-of-the-operation as a Merge Record, and the RocksDB compaction process lazily applies that intent to the original value. This feature is described in detail in Merge Operator
RocksDB本身支持三種類型的記錄,Put記錄、Delete記錄和Merge記錄。當(dāng)壓縮流程遇到Merge記錄時,它會調(diào)用一個應(yīng)用程序指定的方法,稱為Merge Operator。Merge可以將多個Put和Merge記錄合并為一個記錄。這個強大的特性允許通常執(zhí)行讀-修改-寫操作的應(yīng)用程序完全避免讀操作。它允許應(yīng)用程序?qū)⒉僮饕鈭D記錄為Merge record,而RocksDB壓縮過程將該意圖惰性地應(yīng)用到原始值。該特性在合并操作符中有詳細(xì)描述

DB ID

A globally unique ID created at the time of database creation and stored in IDENTITY file in the DB folder by default. Optionally it can only be stored in the MANIFEST file. Storing in the MANIFEST file is recommended.
在創(chuàng)建數(shù)據(jù)庫時創(chuàng)建的全局唯一ID,默認(rèn)情況下存儲在DB文件夾中的IDENTITY文件中?;蛘?,它只能存儲在MANIFEST文件中。建議存儲在MANIFEST文件中。

5. Tools

There are a number of interesting tools that are used to support a database in production. The sst_dump utility dumps all the keys-values in a sst file, as well as other information. The ldb tool can put, get, scan the contents of a database. ldb can also dump contents of the MANIFEST, it can also be used to change the number of configured levels of the database. See Administration and Data Access Tool for details.
有許多有趣的工具用于在生產(chǎn)環(huán)境中支持?jǐn)?shù)據(jù)庫。sst_dump實用程序轉(zhuǎn)儲一個sst文件中的所有鍵值以及其他信息。ldb工具可以放置、獲取、掃描數(shù)據(jù)庫的內(nèi)容。ldb還可以轉(zhuǎn)儲MANIFEST的內(nèi)容,它還可以用來更改配置的數(shù)據(jù)庫級別的數(shù)量。詳情請參見管理和數(shù)據(jù)訪問工具。

6. Tests

There are a bunch of unit tests that test specific features of the database. A make check command runs all unit tests. The unit tests trigger specific features of RocksDB and are not designed to test data correctness at scale. The db_stress test is used to validate data correctness at scale. See Stress-test.
有一堆測試數(shù)據(jù)庫特定特性的單元測試。生成檢查命令運行所有單元測試。單元測試觸發(fā)了RocksDB的特定特性,并不是為了大規(guī)模測試數(shù)據(jù)正確性而設(shè)計的。db_stress測試用于大規(guī)模驗證數(shù)據(jù)的正確性??吹綁毫y試。

7. Performance

RocksDB performance is benchmarked via a utility called db_bench. db_bench is part of the RocksDB source code. Performance results of a few typical workloads using Flash storage are described here. You can also find RocksDB performance results for in-memory workload here.
RocksDB的性能是通過一個名為db_bench的實用程序進行基準(zhǔn)測試的。db_bench是RocksDB源代碼的一部分。這里描述了使用Flash存儲的幾個典型工作負(fù)載的性能結(jié)果。您還可以在這里找到針對內(nèi)存工作負(fù)載的RocksDB性能結(jié)果。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容