clickhouse 物化視圖的踩坑記錄

由于線上查詢大于1s,需要對于該次查詢進行優(yōu)化;為了加快查詢的效率,我們在基礎(chǔ)表上建立了一個物化視圖

CREATE MATERIALIZED VIEW dwst.tt (
`sort_key` UInt8,
 `id` UInt64,
 `type` UInt8,
 `is_profit_avg` UInt8,
 `bd1` UInt64,
 `bd2` UInt64,
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{ck_cluster}/dwst/tt', '{replica}') PARTITION BY sn_sort_key ORDER BY (id, type, bd1, bd2) SETTINGS index_granularity = 8192 AS 
SELECT halfMD5(id) % 64 AS sn_sort_key, id,  type, 
multiIf(((sum(v1) - sum(v2)) < 0, 2, 1) AS is_profit_avg, bd1, bd2 FROM dwst.base_detail  WHERE date <=(today() - 10) GROUP BY sort_key,id,type,bd1,
    bd2

為了安全性的原因,去掉了一些細節(jié);大概的意思就是在base_detail基礎(chǔ)表的基礎(chǔ)上聚合了id,type,bd1,bd2,截止在t-10 是否盈利的情況,因為基礎(chǔ)表的數(shù)據(jù)量比較大,想利用物化視圖,提前預計算數(shù)據(jù),減少查詢sql的是時間;
我們在實踐的過程中,發(fā)現(xiàn)了兩個問題,也對于clickhouse物化視圖有個更深的一些理解

問題一: 每次視圖中的數(shù)據(jù)總數(shù)不一致

  • 本地通過insert....remote的語句,模擬主表的插入,就是觸發(fā)物化視圖的計算功能;但是當基礎(chǔ)信息表數(shù)據(jù)更新之后,發(fā)現(xiàn)每次聚合的數(shù)據(jù)條目總數(shù)都是不一致

    通過查看clickhouse的官方issue,看看有沒有類似的問題,果然發(fā)現(xiàn)有兩條類似的問題 Duplicated data in MATERIALIZED VIEW Duplicated primary key in materialized view ,官方給出的解釋為下面

    Data blocks are deduplicated. For multiple writes of the same data block (data blocks of the same size containing the same rows in the same order), the block is only written once. 
The reason for this is in case of network failures when the client application doesn’t know if the data was written to the DB, so the INSERT query can simply be repeated.
 It doesn’t matter which replica INSERTs were sent to with identical data. 
 INSERTs are idempotent. Deduplication parameters are controlled by merge_tree server settings.
大概的意思 clickhouse insert 語句是冪等的,在對于同一個data block進行寫操作的時候,由于網(wǎng)絡原因,客戶端應用不確認數(shù)據(jù)已經(jīng)被寫入了,所以就出現(xiàn)了重復插入的問題

好吧,建議采用以下的解決方案:

*   使用子查詢對于重復的數(shù)據(jù)進行二次加工,進行去重(官方推薦)

*   使用ReplicatedReplacingMergeTree 執(zhí)行引擎進行數(shù)據(jù)的去重,這是我在實踐中想采用的,每次使用插入完數(shù)據(jù)之后,通過optimize table 的方式,對數(shù)據(jù)進行去重;

問題二: 每次盈利的行數(shù)不一致

對于數(shù)據(jù)進行去重之后,我發(fā)現(xiàn)數(shù)據(jù)的總數(shù)是準確的,但是每次的is_profit_avg的總數(shù)卻不是一致的,這使我有點惱火了;后面通過查找官方的文檔
A materialized view is implemented as follows: when inserting data to the table specified in `SELECT`, 
part of the inserted data is converted by this `SELECT` query, and the result is inserted in the view.

Important
Materialized views in ClickHouse are implemented more like insert triggers. If there’s some aggregation in the view query, it’s applied only to the batch of freshly inserted data.
 Any changes to existing data of source table (like update, delete, drop partition, etc.) doesn’t change the materialized view.</pre>

簡單翻譯一下就是:物化視圖本質(zhì)就像insert語句的觸發(fā)器;如果有什么集合的運算,他會應用于最新插入的數(shù)據(jù)當中;對于其他原表的變化,比如說,更新,刪除,刪除分區(qū),都不會影響到物化視圖的變化

A `SELECT` query can contain `DISTINCT`, `GROUP BY`, `ORDER BY`, `LIMIT`… Note that the corresponding conversions are performed independently on each block of inserted data.
   For example, if `GROUP BY` is set, data is aggregated during insertion, but only within a single packet of inserted data. 
The data won’t be further aggregated. The exception is when using an `ENGINE` that independently performs data aggregation, such as `SummingMergeTree`.

查詢語句可以包含distinct,group by ,order by ,limit,特別注意這些相關(guān)聯(lián)的約束只能應用于每個新插入的數(shù)據(jù)塊中;比如說,如果設置了group by ,這些語句只會應用于新插入的的數(shù)據(jù)當中,不會作用于已經(jīng)插入的分區(qū)當中;

總結(jié)

實踐中的例子,group by維度之后,得到的盈利值 是對于歷史數(shù)據(jù)整體求差值;必須對于歷史中的每條數(shù)據(jù)都要進行運算;這不太符合material view中的實踐場景,本質(zhì)上,物化視圖是對于流數(shù)據(jù)的處理,單條數(shù)據(jù)就是一個值,通過這個值進行累加,而不是對于離線數(shù)據(jù)的整體處理所得到的值;所以對于本次查詢的優(yōu)化,就放棄了使用物化視圖的方式;直接用中間表,每天計算一次;

參考鏈接

1: Duplicated data in MATERIALIZED VIEW

2: Duplicated primary key in materialized view

3: Data Replication

4: CREATE VIEW

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容