
原文:點(diǎn)這里
該文章原文是俄文,被人翻譯成了英文,整體風(fēng)格略顯生硬。不喜歡看的,可以直接跳過,看最后我增加的結(jié)論。
另外,感謝@張健同學(xué)(一個(gè)著迷于源碼的小鮮肉)的大力幫助~翻譯有誤請(qǐng)指正~
ClickHouse是來自“戰(zhàn)斗民族”的高性能分析性數(shù)據(jù)庫(kù),圈內(nèi)人戲稱為“喀秋莎數(shù)據(jù)庫(kù)”。
Altinity是國(guó)外一家從事ClickHouse咨詢、服務(wù)的公司,該公司高管由ClickHouse開發(fā)者,以及來自Percona的專家組成。目前Altinity的ClickHouse云服務(wù)測(cè)試版已經(jīng)上線。
本文背景
- Our guest author is ClickHouse evangelist Yegor Andreenko. Follow him on Twitter @f1yegor.
- This is a cross-post from: (https://medium.com/@f1yegor/clickhouse-primary-keys-2cf2a45d7324)
- Special thanks to Alexey Milovidov, ClickHouse developer, for providing material for this article.
譯文
- 我們的客座嘉賓,是ClickHouse的布道師 Yegor Andreenko,這是他的Twitter:@f1yegor
- 這篇文章是從這里轉(zhuǎn)載過來的: https://medium.com/@f1yegor/clickhouse-primary-keys-2cf2a45d7324
- 特別感謝ClickHouse的開發(fā)者Alexey Milovidov為本文提供的原始素材。
正文
- Recently I dived deep into ClickHouse . ClickHouse is column-store database by Yandex with great performance for analytical queries. For example check benchmark and post of Mark Litwintschik.
- This is the translation of answer given by Alexey Milovidov (creator of ClickHouse) about composite primary key .
譯文
- 最近我開始深入研究ClickHouse。ClickHouse是Yandex開源的高性能的列式分析數(shù)據(jù)庫(kù)。關(guān)于性能測(cè)試,可以看看這兩篇:
- 這篇文章,其實(shí)是對(duì)ClickHouse的開發(fā)者Alexey Milovidov,在Google網(wǎng)上論壇,對(duì)ClickHouse主鍵問題回復(fù)的翻譯和總結(jié)(原貼為俄文)。
原問題
- How many columns primary key could have? And what is layout of data on storage drive? Are there any theoretical/practical limits?
- Could columns with missing data at some rows be part of the primary key?
CK的主鍵最多可以有多少個(gè)列組成?主鍵在磁盤上的存儲(chǔ)形式是什么樣的?理論或者實(shí)際使用過程中,有什么約束和限制么?
如果某些列缺省了某些值,可以作為主鍵么?
ClickHouse的主鍵
- Data in table of MergeTree type stored in set of multiple parts. On average you could expect little number of parts (units-tens per month).
- In every part data stored sorted lexicographically by primary key. For example, if your primary key — (CounterID, Date), than rows would be located sorted by CounterID, and for rows with the same CounterID — sorted by Date.
- Data structure of primary key looks like an array of marks — it’s values of primary key every index_granularity rows.
- index_granularity — settings of MergeTree engine, default to 8192.
譯文
- MergeTree 系列的引擎,數(shù)據(jù)是由多組part文件組成的,一般來說,每個(gè)月(譯者注:CK目前最小分區(qū)單元是月)會(huì)有幾個(gè)part文件(這里的part就是block)。
- 每一個(gè)part的數(shù)據(jù),是按照主鍵進(jìn)行字典序排列。例如,如果你有一個(gè)主鍵是(CounterID, Date),數(shù)據(jù)行會(huì)首先按照CounterID排序,如果CounterID相同,按照Date排序。
- 主鍵的數(shù)據(jù)結(jié)構(gòu),看起來像是標(biāo)記文件組成的矩陣,這個(gè)標(biāo)記文件就是每間隔
index_granularity(索引粒度)行的主鍵值。 - MergeTree引擎中,默認(rèn)的
index_granularity設(shè)置是8192。
舉例
- We say that primary key is sparse index of sorted data. Let’s visualise it with only one part. (I should have equal length between marks, but it’s a bit imperfect to draw asci-art here):
- It’s convenient to represent marks as marks of ruler. Primary key allows effectively read range of data. For select ClickHouse chooses set of mark ranges that could contain target data.

- 主鍵是有序數(shù)據(jù)的稀疏索引。我們用圖的方式看一部分的數(shù)據(jù)(原則上,圖中應(yīng)該保持標(biāo)記的平均長(zhǎng)度,但是用ASCI碼的方式不太方便)。
- mark文件,就像一把尺子一樣。主鍵對(duì)于范圍查詢的過濾效率非常高。對(duì)于查詢操作,CK會(huì)讀取一組可能包含目標(biāo)數(shù)據(jù)的mark文件。
This way, if you select CounterID IN (‘a(chǎn)’, ‘h’), server reads data with mark ranges [0, 3) and [6, 8).
例如,如果你的查詢條件是CounterID IN (‘a(chǎn)’, ‘h’),服務(wù)器將會(huì)讀取標(biāo)記文件為[0, 3]和[6, 8]之間對(duì)應(yīng)的數(shù)據(jù)文件。
if you select CounterID IN (‘a(chǎn)’, ‘h’) AND Date = 3, server reads data with mark ranges [1, 3) and [7, 8).
如果你的查詢條件是CounterID IN (‘a(chǎn)’, ‘h’)并且指定了Date = 3,服務(wù)器將會(huì)讀取標(biāo)記文件為[1, 3)和[7, 8)之間對(duì)應(yīng)的數(shù)據(jù)文件。
Sometimes primary key works even if only the second column condition presents in select:
有時(shí),主鍵的過濾效果并不是很好,比如,只有第二列出現(xiàn)在查詢條件中:
if you select Date = 3, server reads data with mark ranges [1, 10).
如果查詢條件只是Date = 3,服務(wù)器講讀取 [1, 10)之間對(duì)應(yīng)的數(shù)據(jù)文件。
In our example it’s all marks except 0 — this is 90% of data. In this case index isn’t really effective, but still allows to skip part of data.
在上述例子中,mark文件除了0,其他90%的數(shù)據(jù)都需要掃描,雖然索引過濾效果不好,但是,仍然是可以跳過一些數(shù)據(jù)的。
On the other hand, if we have more data for one CounterID, index allows to skip wider ranges of Date in data.
另一方面,如果每個(gè)CounterID對(duì)應(yīng)多條數(shù)據(jù),索引將會(huì)跳過更多的Date數(shù)據(jù)。(???)
In any case, usage of index never could be less efficient than full scan.
綜合來講,使用索引,總是會(huì)比全表掃描要高效一些的。
Sparse index could read unnecessary rows: during read of one range of primary key, index_granularity * 2 unnecessary rows in every part. It’s normal and you shouldn’t try to reduce index_granularity. ClickHouse designed to work effective with data by large batches of rows, that’s why a bit of additional column during read isn’t hurt the performance. index_granularity = 8192 — good value for most cases.
稀疏索引會(huì)讀取很多不必要的數(shù)據(jù):讀取primary key的每一個(gè)part,會(huì)多讀取
index_granularity * 2的數(shù)據(jù)。這對(duì)于稀疏索引來說很正常,也沒有必要減少index_granularity的值。ClickHouse的設(shè)計(jì),致力于高效的處理海量數(shù)據(jù),這就是為什么一些多余的讀取并不會(huì)有損性能。index_granularity=8192對(duì)于大多數(shù)場(chǎng)景都是比較好的選擇。Sparse index allows to work with tables that have enormous number of rows. And it always fits in RAM.
稀疏索引適用于海量數(shù)據(jù)表,并且,稀疏索引文件本身,放到內(nèi)存是沒有問題的。
Primary key isn’t unique. You can insert many rows with the same value of primary key.
主鍵并不是唯一的,可以插入主鍵相同的數(shù)據(jù)行。
-
Primary key can also contain functional expressions.
- Example: (CounterID, EventDate, intHash32(UserID))
-
主鍵的構(gòu)成,同樣可以存在函數(shù)表達(dá)式。
- 如,(CounterID, EventDate, intHash32(UserID))
Above it’s used to mix up the data of particular UserID for every tuple CounterID, EventDate. By-turn it’s used in sampling ( https://clickhouse.yandex/reference_en.html#SAMPLE clause).
上述例子中,通過使用Hash函數(shù),把特定的UserID對(duì)應(yīng)的CounterID和EventDate做了聚合,順便,這種聚合方式,可以在SAMPLE這個(gè)功能中利用到。
總結(jié)
-
Let’s sum up what choice of primary key affects:
-
The most important and obvious: primary key allows to read less data during SELECT queries.
As shown in examples above it’s usually doesn’t make sense to include many columns into primary key for this purpose. Let’s say you have primary key (a, b). By adding one more column c: (a, b, c) makes sense only if it conforms with both conditions:
- if you have queries with filter for this column;
- in your data could be quite long (several time bigger than index_granularity) ranges of data with the same values of (a, b).
In other words when adding one more column will allow to skip big enough ranges of data.
Data is sorted by primary key. That way data is more compressable. Sometimes it happens that by adding one more column into primary key data could be compressed better.
When you use different kinds of MergeTree with additional logic in merge: CollapsingMergeTree, SummingMergeTree and etc., primary key affects merge of data. For this reason it might be necessary to use more columns in primary key even when it’s not necessary for point 1.
-
-
總結(jié)一下,主鍵改變會(huì)帶來哪些影響:
-
顯而易見,主鍵最重要的作用,就是減少了查詢操作中過多的數(shù)據(jù)讀取。如上面的例子,給主鍵增加過多的列,并沒有太大的意義。
舉例來說,你有一個(gè)主鍵是(a, b),增加一個(gè)列c,主鍵變成(a, b, c),這種變動(dòng),只在下列情況有作用:
- 你的查詢使用到了c列
- 以(a, b)組合,去重后的值,要遠(yuǎn)大于
index_granularity的值。
換句話說,增加一個(gè)列到主鍵中去,可以跳過很多的數(shù)據(jù)。
數(shù)據(jù)是按照主鍵排序的,這樣更容易做數(shù)據(jù)壓縮。有時(shí),主鍵增加一列后,會(huì)帶來更好的數(shù)據(jù)壓縮效果。
當(dāng)使用MergeTree家族的其他引擎,比如CollapsingMergeTree, SummingMergeTree等,主鍵影響數(shù)據(jù)的merge操作??紤]到這個(gè)原因,有必要增加主鍵的寬度,而忽略第一種情況。
-
- Number of columns into primary key isn’t limited explicitly. Long primary key is usually useless. In real use case the maximum that I saw was ~20 columns (for SummingMergeTree), but I don’t recommend this variant.
- Long primary key will negatively affect insert performance and memory usage.
- Long primary key will not negatively affect the performance of SELECT queries.
- 組成主鍵的列的數(shù)量,并沒有明確規(guī)定。過長(zhǎng)的主鍵通常來說沒啥用。在日常經(jīng)驗(yàn)中,我見過最長(zhǎng)的主鍵是的是20個(gè)列(在SummingMergeTree引擎中場(chǎng)景下),但我并不建議這樣的方式。
- 過長(zhǎng)的主鍵,會(huì)拖慢寫入性能,并且會(huì)造成過多的內(nèi)存占用。
- 過長(zhǎng)的主鍵,并不會(huì)影響對(duì)查詢性能有太大的影響。
- During insert, missing values of all columns will be replaced with default values and written to table.
- 插入過程中,如果部分字段沒有指定確切的values,將會(huì)被默認(rèn)值替代寫入表中。
譯者注
關(guān)于最左原則
了解MySQL的人,都知道一個(gè)重要的索引概念,最左原則,即對(duì)于(a, b, c)這樣的索引,如果僅僅指定一個(gè)查詢條件b或者c,是無法用到索引的,同樣需要全表掃描,原因在于MySQL使用的是B樹索引
在CK里,主鍵索引用的并不是B樹,而是稀疏索引。
-
如圖,在一個(gè)表的分區(qū)里,包括如下數(shù)據(jù)文件:
- 校驗(yàn)值
- 列名
- 列.mrk
- 列.bin
-
主鍵
-
它們的關(guān)系如下:
- 每隔8192行數(shù)據(jù),是1個(gè)block
- 主鍵會(huì)每隔8192,取一行主鍵列的數(shù)據(jù),同時(shí)記錄這是第幾個(gè)block
- 查詢的時(shí)候,如果有索引,就通過索引定位到是哪個(gè)block,然后找到這個(gè)block對(duì)應(yīng)的mrk文件
- mrk文件里記錄的是某個(gè)block的數(shù)據(jù)集,在整列bin文件的哪個(gè)物理偏移位置
- 加載數(shù)據(jù)到內(nèi)存,之后并行化過濾
存儲(chǔ)結(jié)構(gòu)-new.png
-
這里分別舉例看看是否依賴最左原則:
索引使用分析-new.png- 在使用半主鍵的時(shí)候,如果出現(xiàn)了索引跳躍,即直接用了y字段,跳過x,實(shí)際上也是可以有一些數(shù)據(jù)過濾的,但是效果并不是很好
- 所以你說這是最左原則還是不是呢?
-
索引使用建議
- 線上使用務(wù)必指定date字段,限制過濾的數(shù)據(jù)行
- 并不是所有的字段都要加入到主鍵中
- (date, hour, min, ts)大多數(shù)情況下會(huì)是比較好的選擇,但是請(qǐng)?jiān)赟QL里用上這幾個(gè)條件
- 實(shí)際生產(chǎn)中,用上date后,即使再全掃其他字段,在百億以下的效率依然很高
CK為啥這么高效?
- 答案就是全表掃描
- 那么問題來了,為什么全表掃描性能還這么快?
- 在計(jì)算機(jī)系統(tǒng)里,有一個(gè)概念叫SIMD,即單指令流多數(shù)據(jù)流(Single Instruction Multiple Data),是一種采用一個(gè)控制器來控制多個(gè)處理器,同時(shí)對(duì)一組數(shù)據(jù)(又稱“數(shù)據(jù)向量”)中的每一個(gè)分別執(zhí)行相同的操作從而實(shí)現(xiàn)空間上的并行性的技術(shù)。這種方式,極大的提升了數(shù)據(jù)的查詢效率,因此可以做到即使是全表掃,也能達(dá)到很高的性能。
- 關(guān)于CK的高效
- CK不能簡(jiǎn)單看成一個(gè)數(shù)據(jù)庫(kù),它用了大量的并行計(jì)算方式,把單機(jī)性能壓榨到極限
- Hadoop生態(tài)非常依賴集群的數(shù)量,通過scale out的方式,讓計(jì)算發(fā)生在本地,分而治之,通過M后再R的方式,提高執(zhí)行效率。
- 在實(shí)際的使用過程中,很明顯的感覺到,10臺(tái)規(guī)模的Hadoop和100臺(tái)規(guī)模的Hadoop無法同日而語,原因就在于數(shù)據(jù)打的不夠散
- CK的方式,可以理解為,通過列式存儲(chǔ)的方式,本身查詢的時(shí)候就做了Map化,再對(duì)每一列做操作的時(shí)候,又使用向量化操作,等于是又增加了并發(fā),因此,單機(jī)效率極高
- 理解有誤請(qǐng)指正


