概覽
《Filter Representation in Vectorized Query Execution》這篇論文向我們展示了在向量化查詢執(zhí)行引擎里的兩種filter representation(bitmap以及select vectors)和兩種執(zhí)行策略:1. Full, 不管有多少active的row,都全部執(zhí)行;2. Partial,只執(zhí)行active的row,并用理論分析,輔之以實(shí)驗(yàn)論證,哪些情況下,哪種組合更優(yōu)。
后文用到的縮寫
- BM for BitMap
- SV for Select Vectors
Update Operation和Map Operation的區(qū)別

Map需要物化其結(jié)果到另一個(gè)新的vector中,update需要返回一個(gè)新的filter representation
原語(yǔ)操作代價(jià)公式

實(shí)現(xiàn)的所有策略
加了Manual后綴表示手動(dòng)的以SIMD的形式寫出核心操作進(jìn)行優(yōu)化,不以Manual結(jié)尾的則依賴編譯器自行進(jìn)行優(yōu)化

不同Operation的最優(yōu)策略
沒有SIMD可以加速的Operation
變長(zhǎng)數(shù)據(jù)的操作
最優(yōu)策略選SVPartial

We see that Full consistently performs worst when it processes more tuples. At full selectivity, Full and Selective process the same number of tuples, but the former slightly benefits from a simpler iteration logic. We also see that the performance of SVPartial is similar to that of BMPartial, despite the former’s simpler iteration logic, meaning that the core operation is the dominating cost.
選擇Partial策略,并且使用BM和SV的差別不大,因?yàn)榧词估碚撋蟂V的迭代邏輯更簡(jiǎn)單,但是主要的耗時(shí)在于算子的真正操作耗時(shí)。
整數(shù)相除/取模
最優(yōu)策略選SVPartial,結(jié)論與與”變長(zhǎng)數(shù)據(jù)的操作“一致
總結(jié)
We conclude that the number of tuples processed is the dominant factor for non-data-parallel primitives, making Full impractical. Because of the small impact of iteration logic time, SVPartial has a performance edge over BMPartial, but developers can choose either representation without much affecting performance.
簡(jiǎn)而言之,需要處理的元組數(shù)量是主要因素,最優(yōu)策略是SVPartial
有SIMD可以加速但不高效的Operation
即存在執(zhí)行分支的Operation,如邏輯或和邏輯與操作
最優(yōu)策略選SVPartial

We can see that Full is either always the worst strategy in Figure 5a, or only competitive at high selectivities (>= 0.85) in Figure 5b. Although Full is competitive at the highest selectivities, we recommend, for the sake of simplicity, the SVPartial strategy.
有SIMD可以加速且沒有分支的Operation
core operations with straight-line and data-parallel (SLDP) code (i.e., non-branching code that leverages SIMD instruc- tions).
Therefore, we expect Selective strategies to only be competitive when the selectivity is low, meaning that Full processes far more tuples. SVManual is the exception: it is a Selective strategy that uses SIMD instructions. Its gain in iteration logic time is, however, not as high as that of Full strategies because it uses a gather instruction [5] to collect the elements at the selected indices (see Figure 10). The higher the core operations time per tuple, the less important the gather overhead because iteration logic time becomes more insignificant. Thus, we expect SVManual to be competitive with Full at medium or below selectivities depending on the core operation.
Selective策略只會(huì)在選擇率較低的情況下更優(yōu),SVManual因?yàn)槭褂昧薙IMD指令,所以它比Full策略差的 選擇率閾值那個(gè)點(diǎn)要更高一些。但是因?yàn)樗褂昧薵ather指令,所以它在迭代效率上的性能提升并沒有那么大,所以在算子核心操作的時(shí)間代價(jià)越大,SVManual比Full策略差的 選擇率閾值就要更高
圖6和圖7中的One Operation和Five Operation的差異就在于Five Operation它模擬了更大核心操作時(shí)間的算子


In all experiments, SVPartial performs better than BMPartial. As explained in the previous section, this is only due to a simpler iteration logic. SVManual is always the best Selective strategy because it uses data-parallelism to reduce the time spent per tuple.
在Selective策略里,SVPartial要比BMPartial更優(yōu),因?yàn)榈矢?;SVManual是Selective策略里最優(yōu)的。
Manual Vectorization sometimes performs better than Auto-Vectorization with Full and BMFull. For example, the results in Figure 7a show that BMFullManual is 1.8× faster than BMFull. Upon investigation of the generated code, we discovered that the compiler is overly conservative when it comes to using AVX512. AVX512 registers result in decreased CPU frequency [10], so the compiler does not always use them. It often uses AVX2 registers instead. We, on the other hand, always use AVX512. As we will see in the next section, the decreased CPU frequency can slow down the query.
因?yàn)榫幾g器的原因,從實(shí)驗(yàn)結(jié)果看,手動(dòng)編寫的SIMD代碼(BMFullManual)要比BMFull更優(yōu),因?yàn)榫幾g器采用一種保守策略,考慮到使用AVX512會(huì)導(dǎo)致CPU主頻下降,經(jīng)常不對(duì)其做SIMD優(yōu)化。
Selective策略只在選擇比較低(<=0.2)的情況下更優(yōu),SVManual 可以將這個(gè)閾值提高(<=0.5左右),所以表明一個(gè)系統(tǒng)需要Mixed的策略,選擇率低時(shí)采用Selective,選擇率高時(shí)采用Full
實(shí)驗(yàn)
Although there are other types of primitives, they often have side-effects (e.g., insertion in a hash table for joins or an array for sorting) and are, therefore, not amenable to our optimizations for correctness reasons.
無(wú)副作用的算子會(huì)應(yīng)用前文的策略,有副作用的并不會(huì),他們只會(huì)選擇有效的行
Q1 – High Selectivity SLDP Primitives
Q1 consists of a set of SLDP operations on vectors with high selectivity (>= 0.95) followed by a side-effect full aggregation that dominates the running time.

BMFull在沒有副作用的算子運(yùn)算時(shí)間上表現(xiàn)最優(yōu),但是總的執(zhí)行時(shí)間,所有執(zhí)行策略大致相同。原因有兩個(gè):1. 主要是有副作用的算子(aggregation)執(zhí)行時(shí)間占主導(dǎo);2. 使用AVX512 SIMD的指令會(huì)導(dǎo)致CPU主頻下降,BMPartial 和BMFull 對(duì)于Aggregation算子的代碼是一樣的,但是BMPartial執(zhí)行只需要739ms,而BMFull執(zhí)行卻需要838ms
Q6 – Mixed Selectivity SLDP Primitives
Q6 contains five SLDP filters, an arithmetic SLDP projection, and a final aggregation. Across all input vectors, the second filter leads to a selectivity smaller than 0.15, triggering a threshold-based switch in the filter’s representation.

所有混合策略都比非混合策略更優(yōu);
盡管SVManual比BMPartial快,但是 BMFull+SVManual 比BMFull+BMPartial慢0.3%,這就是由于Filter Representation由BM向SV轉(zhuǎn)化的代價(jià)。
Finally, the SLDP primitives dominate the total running time for this query, so we do not observe the slowdown caused by AVX512 registers. Thus, for queries with a structure similar to Q6, the benefits of AVX512 outweigh its disadvantages.
因?yàn)镼6的運(yùn)行時(shí)代價(jià)主要由SLDP原語(yǔ)構(gòu)成,所以使用AVX512優(yōu)大于劣
Q4 – Low Selectivity, Inefficient Parallelism
Q4 is a join of two tables (LINEITEM, ORDERS) followed an aggregation and an order-by operator.
Hash Join的build側(cè),包含兩個(gè)SLDP的Filter,第一個(gè)filter選擇率為0.3,第二個(gè)降到0.1。Join的Probe側(cè),有一個(gè)選擇率為0.6的filter,并且包含很多復(fù)雜的探測(cè)原語(yǔ),如accessing hash table entries or the keys within those entries for exact comparison,這些原語(yǔ)使用SIMD的加速是不高效的,所以對(duì)于probe側(cè),應(yīng)該使用BMFull去執(zhí)行SLDP的filter,并使用SVPartial執(zhí)行復(fù)雜的原語(yǔ)而不是SVManual.

實(shí)驗(yàn)結(jié)果表明,BMFull+SVPartial 是最優(yōu)的。
總結(jié)
This work analyzed the impact of filter representation (i.e., Bitmap vs. Selection Vector) and compute strategy (i.e., Full vs. Selective) on the performance of the vectorized primitives in an in-memory analytical DBMS. We identified the factors that influence performance: number of tuples processed, iteration logic, and core operation time per tuple. We explained how each combination of representation and compute strategy balances between these three factors. Full has the cheapest iteration logic, processes all tuples, but spends less time on each tuple when SIMD vectorization is possible. Full is, however, only available with Bitmaps on Update primitives. Selective with SVs has a cheaper iteration logic than Selective with Bitmaps, and is more amenable to SIMD vectorization. We confirmed these observations with several micro-benchmarks. Finally, we showcased the benefits of our analysis on OLAP queries with multiple primitives and consistently achieved the best performance. Our performance gains over the best techniques that do not adapt filter representation and compute strategy can be up to 1.3×.
這篇文章在一個(gè)分析型內(nèi)存數(shù)據(jù)庫(kù)上,分析了filter的表示(BM還是SV)以及計(jì)算策略(Selective還是Full)對(duì)于向量化執(zhí)行的原語(yǔ)效率的影響。指出了影響效率的因素:需要處理的元組數(shù)量,迭代邏輯以及處理每個(gè)原子的核心操作所需要的時(shí)間。解釋了表示和計(jì)算策略的每種組合如何在這三個(gè)因素之間取得平衡。Full 的迭代邏輯最簡(jiǎn)單,代價(jià)最小,需要處理所有元組,但當(dāng) 有SIMD 加持時(shí),在每個(gè)元組上花費(fèi)的時(shí)間更少。但是,對(duì)于Update的原語(yǔ),F(xiàn)ull 僅適用于 Bitmaps表示。 Selective with SVs 比 Selective with Bitmaps 的迭代代價(jià)更小,并且更適合 SIMD 矢量化。并且文章中也用幾個(gè)微基準(zhǔn)證實(shí)了這些觀察結(jié)果。最后,文章展示了Mixed策略要比單種的表示和計(jì)算策略組合的最佳技術(shù)更優(yōu)。
對(duì)Apache IoTDB的MPP查詢執(zhí)行框架的思考
IoTDB中目前采取的是Full + BM的組合策略,當(dāng)然這里的BitMap,IoTDB使用的是一個(gè)boolean[],其實(shí)也是基于我們大部分的查詢場(chǎng)景選擇率都比較高,按照論文中的實(shí)驗(yàn)結(jié)果,選擇率較高時(shí),使用FullBM其實(shí)是最優(yōu)的。
當(dāng)然之后IoTDB也會(huì)采用Mixed策略,在一些異常點(diǎn)檢測(cè)的查詢場(chǎng)景下,選擇率一般都會(huì)比較低,在這種場(chǎng)景下,Partial + SV/BM的組合應(yīng)該是會(huì)更優(yōu)的,至于采用SV還是BM,這個(gè)還需要進(jìn)一步實(shí)驗(yàn)驗(yàn)證,雖然SV的迭代代價(jià)更小,但是首先受制于Java語(yǔ)言,即使采用了PartialSV,我們也無(wú)法像論文中一樣,進(jìn)一步用Manual的方式進(jìn)行SIMD優(yōu)化;其次,因?yàn)镮oTDB中其他查詢組件還都會(huì)保留BM的Filter Representation,所以SV到BM轉(zhuǎn)換的overhead也要考慮進(jìn)去。