[DuckDB] 多核算子并行的源碼解析

DuckDB 是近年來(lái)頗受關(guān)注的OLAP數(shù)據(jù)庫(kù),號(hào)稱是OLAP領(lǐng)域的SQLite,以精巧簡(jiǎn)單,性能優(yōu)異而著稱。筆者前段時(shí)間在調(diào)研Doris的Pipeline的算子并行方案,而DuckDB基于論文《Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age》實(shí)現(xiàn)SQL算子的高效并行化的Pipeline執(zhí)行引擎,所以筆者花了一些時(shí)間進(jìn)行了學(xué)習(xí)和總結(jié),這里結(jié)合了Mark Raasveldt進(jìn)行的分享和原始代碼來(lái)一一剖析DuckDB在執(zhí)行算子并行上的具體實(shí)現(xiàn)。

1. 基礎(chǔ)知識(shí)

問(wèn)題1:并行task的數(shù)目由什么決定 ?

image.png

Pipeline的核心是:Morsel-Driven,數(shù)據(jù)是拆分成了小部分的數(shù)據(jù)。所以并行Task的核心是:能夠利用多線程來(lái)處理數(shù)據(jù),每一個(gè)數(shù)據(jù)拆分為小部分,所以拆分并行的數(shù)目由Source決定。

DuckDB在GlobalSource上實(shí)現(xiàn)了一個(gè)虛函數(shù)MaxThread來(lái)決定task數(shù)目:

image.png

每一個(gè)算子的GlobalSource抽象了自己的并行度:

image.png

問(wèn)題2:并行task的怎么樣進(jìn)行多線程同步:

  • 多線程的競(jìng)爭(zhēng)只會(huì)發(fā)生在SinkOperator上,也就是Pipeline的尾端。
  • parallelism-aware的算法需要實(shí)現(xiàn)在Sink端
  • 其他的非Sink operators (比如:Hash Join Probe, Projection, Filter等), 不需要感知多線程同步的問(wèn)題
image.png

問(wèn)題3:DuckDB的是如何抽象接口的:

Sink的Opeartor 定義了兩種類型:GlobalState, LocalState

  1. GlobalState: 每個(gè)查詢的Operator全局只有一個(gè)GlobalSinkState,記錄全局部分的信息
class PhysicalOperator {
public:
    unique_ptr<GlobalSinkState> sink_state;
  1. LocalState: 每個(gè)查詢的PipelineExecutor都有一個(gè)LocalSinkState,都是局部私有
//! The Pipeline class represents an execution pipeline
class PipelineExecutor {
private:
    //! The local sink state (if any)
    unique_ptr<LocalSinkState> local_sink_state;

后續(xù)會(huì)詳細(xì)解析不同的sink之間的LocalState和GlobalState如何配合的,核心部分如下:

image.png

Sink :處理LocalState的數(shù)據(jù)

Combine:合并LocalState到GlobalState之中

2. 核心算子的并行

這部分進(jìn)行各個(gè)算子的源碼剖析,筆者在源碼的關(guān)鍵部分加上了中文注釋,以方便大家的理解

Sort算子

  • Sink接口:這里需要注意的是DuckDB排序是進(jìn)行了列轉(zhuǎn)行的工作的,后續(xù)讀取時(shí)需要行轉(zhuǎn)列。Sink這部分相當(dāng)于實(shí)現(xiàn)了部分?jǐn)?shù)據(jù)的排序工作。
SinkResultType PhysicalOrder::Sink(ExecutionContext &context, GlobalSinkState &gstate_p, LocalSinkState &lstate_p,
                                   DataChunk &input) const {
    auto &lstate = (OrderLocalSinkState &)lstate_p;
        
      // keys 是排序的列block,payload是輸出的排序后數(shù)據(jù),這里調(diào)用LocalState的SinkChunk,進(jìn)行數(shù)據(jù)的轉(zhuǎn)行,
    local_sort_state.SinkChunk(keys, payload);

    // 數(shù)據(jù)達(dá)到內(nèi)存閾值的時(shí)候進(jìn)行基數(shù)排序處理,排序之后的結(jié)果存入LocalState的本地的SortedBlock中
    if (local_sort_state.SizeInBytes() >= gstate.memory_per_thread) {
        local_sort_state.Sort(global_sort_state, true);
    }
    return SinkResultType::NEED_MORE_INPUT;
}
  • Combine接口: 加鎖,拷貝sorted block到Global State
void PhysicalOrder::Combine(ExecutionContext &context, GlobalSinkState &gstate_p, LocalSinkState &lstate_p) const {
    auto &gstate = (OrderGlobalSinkState &)gstate_p;
    auto &lstate = (OrderLocalSinkState &)lstate_p;
        // 排序剩余內(nèi)存中不滿的數(shù)據(jù)
    local_sort_state.Sort(*this, external || !local_sort_state.sorted_blocks.empty());

    // Append local state sorted data to this global state
    lock_guard<mutex> append_guard(lock);
    for (auto &sb : local_sort_state.sorted_blocks) {
        sorted_blocks.push_back(move(sb));
    }
}
  • MergeTask:?jiǎn)?dòng)核數(shù)相同的task來(lái)進(jìn)行Merge (這里可以看出DuckDB對(duì)于多線程的使用是很激進(jìn)的), 這里是通過(guò)Event的機(jī)制實(shí)現(xiàn)的
void Schedule() override {
        auto &context = pipeline->GetClientContext();
        idx_t num_threads = ts.NumberOfThreads();

        vector<unique_ptr<Task>> merge_tasks;
        for (idx_t tnum = 0; tnum < num_threads; tnum++) {
            merge_tasks.push_back(make_unique<PhysicalOrderMergeTask>(shared_from_this(), context, gstate));
        }
        SetTasks(move(merge_tasks));
    }

class PhysicalOrderMergeTask : public ExecutorTask {
public:
    TaskExecutionResult ExecuteTask(TaskExecutionMode mode) override {
        // Initialize merge sorted and iterate until done
        auto &global_sort_state = state.global_sort_state;
        MergeSorter merge_sorter(global_sort_state, BufferManager::GetBufferManager(context));
        
        // 加鎖,獲取兩路,不斷進(jìn)行兩路歸并,最終完成全局排序。
    while (true) {
        {
            lock_guard<mutex> pair_guard(state.lock);
            if (state.pair_idx == state.num_pairs) {
                break;
            }
            GetNextPartition();
        }
        MergePartition();
    }
        event->FinishTask();
        return TaskExecutionResult::TASK_FINISHED;
    }

聚合算子(這里分析的是Prefetch Agg Operator算子)

  • Sink接口:和Sort算子一樣,這里拆分為Group ChunkAggregate Input Chunk,可以理解為代表聚合時(shí)的key與value列。注意此時(shí)Sink接口上的聚合是在LocalSinkState上完成的。
SinkResultType PhysicalPerfectHashAggregate::Sink(ExecutionContext &context, GlobalSinkState &state,
                                                  LocalSinkState &lstate_p, DataChunk &input) const {
    lstate.ht->AddChunk(group_chunk, aggregate_input_chunk);
}


void PerfectAggregateHashTable::AddChunk(DataChunk &groups, DataChunk &payload) {
    auto address_data = FlatVector::GetData<uintptr_t>(addresses);
    memset(address_data, 0, groups.size() * sizeof(uintptr_t));
    D_ASSERT(groups.ColumnCount() == group_minima.size());

    // 計(jì)算group key列對(duì)應(yīng)的entry的位置
    idx_t current_shift = total_required_bits;
    for (idx_t i = 0; i < groups.ColumnCount(); i++) {
        current_shift -= required_bits[i];
        ComputeGroupLocation(groups.data[i], group_minima[i], address_data, current_shift, groups.size());
    }

    // 通過(guò)data加上面的entry位置 + tuple的偏移量,計(jì)算出對(duì)應(yīng)的內(nèi)存地址,并進(jìn)行init
    idx_t needs_init = 0;
    for (idx_t i = 0; i < groups.size(); i++) {
        D_ASSERT(address_data[i] < total_groups);
        const auto group = address_data[i];
        address_data[i] = uintptr_t(data) + address_data[i] * tuple_size;
    }
    RowOperations::InitializeStates(layout, addresses, sel, needs_init);

    // after finding the group location we update the aggregates
    idx_t payload_idx = 0;
    auto &aggregates = layout.GetAggregates();
    for (idx_t aggr_idx = 0; aggr_idx < aggregates.size(); aggr_idx++) {
        auto &aggregate = aggregates[aggr_idx];
        auto input_count = (idx_t)aggregate.child_count;
                // 進(jìn)行聚合的Update操作
        RowOperations::UpdateStates(aggregate, addresses, payload, payload_idx, payload.size());
    }
}
  • Combine接口: 加鎖,merge local hash tableglobal hash table
void PhysicalPerfectHashAggregate::Combine(ExecutionContext &context, GlobalSinkState &gstate_p,
                                           LocalSinkState &lstate_p) const {
    auto &lstate = (PerfectHashAggregateLocalState &)lstate_p;
    auto &gstate = (PerfectHashAggregateGlobalState &)gstate_p;

    lock_guard<mutex> l(gstate.lock);
    gstate.ht->Combine(*lstate.ht);
}
        // local state的地址vector
    Vector source_addresses(LogicalType::POINTER);
       // global state的地址vector
    Vector target_addresses(LogicalType::POINTER);
    auto source_addresses_ptr = FlatVector::GetData<data_ptr_t>(source_addresses);
    auto target_addresses_ptr = FlatVector::GetData<data_ptr_t>(target_addresses);

    // 遍歷所有hash table的表,然后進(jìn)行合并對(duì)應(yīng)能夠合并的key
    data_ptr_t source_ptr = other.data;
    data_ptr_t target_ptr = data;
    idx_t combine_count = 0;
    idx_t reinit_count = 0;
    const auto &reinit_sel = *FlatVector::IncrementalSelectionVector();
    for (idx_t i = 0; i < total_groups; i++) {
        auto has_entry_source = other.group_is_set[i];
        // we only have any work to do if the source has an entry for this group
        if (has_entry_source) {
            auto has_entry_target = group_is_set[i];
            if (has_entry_target) {
                // both source and target have an entry: need to combine
                source_addresses_ptr[combine_count] = source_ptr;
                target_addresses_ptr[combine_count] = target_ptr;
                combine_count++;
                if (combine_count == STANDARD_VECTOR_SIZE) {
                    RowOperations::CombineStates(layout, source_addresses, target_addresses, combine_count);
                    combine_count = 0;
                }
            } else {
                group_is_set[i] = true;
                // only source has an entry for this group: we can just memcpy it over
                memcpy(target_ptr, source_ptr, tuple_size);
                // we clear this entry in the other HT as we "consume" the entry here
                other.group_is_set[i] = false;
            }
        }
        source_ptr += tuple_size;
        target_ptr += tuple_size;
    }

        // 做對(duì)應(yīng)的merge操作
    RowOperations::CombineStates(layout, source_addresses, target_addresses, combine_count);

Join算子

  • Sink接口:和Sort算子一樣,注意此時(shí)Sink接口上的hash 表是在LocalSinkState上完成的。
SinkResultType PhysicalHashJoin::Sink(ExecutionContext &context, GlobalSinkState &gstate_p, LocalSinkState &lstate_p,
                                      DataChunk &input) const {
    auto &gstate = (HashJoinGlobalSinkState &)gstate_p;
    auto &lstate = (HashJoinLocalSinkState &)lstate_p;

    lstate.join_keys.Reset();
    lstate.build_executor.Execute(input, lstate.join_keys);
    // build the HT
    auto &ht = *lstate.hash_table;
    if (!right_projection_map.empty()) {
        // there is a projection map: fill the build chunk with the projected columns
        lstate.build_chunk.Reset();
        lstate.build_chunk.SetCardinality(input);
        for (idx_t i = 0; i < right_projection_map.size(); i++) {
            lstate.build_chunk.data[i].Reference(input.data[right_projection_map[i]]);
        }
                // 構(gòu)建local state的hash 表
        ht.Build(lstate.join_keys, lstate.build_chunk)

    return SinkResultType::NEED_MORE_INPUT;
}
  • Combine接口: 加鎖,拷貝local state的hash表到global state
void PhysicalHashJoin::Combine(ExecutionContext &context, GlobalSinkState &gstate_p, LocalSinkState &lstate_p) const {
    auto &gstate = (HashJoinGlobalSinkState &)gstate_p;
    auto &lstate = (HashJoinLocalSinkState &)lstate_p;
    if (lstate.hash_table) {
        lock_guard<mutex> local_ht_lock(gstate.lock);
        gstate.local_hash_tables.push_back(move(lstate.hash_table));
    }
}
  • MergeTask:?jiǎn)?dòng)核數(shù)相同的task來(lái)進(jìn)行Hash table的Merge (這里可以看出DuckDB對(duì)于多線程的使用是很激進(jìn)的), 每個(gè)任務(wù)merge一部分Block(DuckDB之中的行數(shù)據(jù),落盤使用)
void Schedule() override {
        auto &context = pipeline->GetClientContext();

        vector<unique_ptr<Task>> finalize_tasks;
        auto &ht = *sink.hash_table;
        const auto &block_collection = ht.GetBlockCollection();
        const auto &blocks = block_collection.blocks;
        const auto num_blocks = blocks.size();
        if (block_collection.count < PARALLEL_CONSTRUCT_THRESHOLD && !context.config.verify_parallelism) {
            // Single-threaded finalize
            finalize_tasks.push_back(
                make_unique<HashJoinFinalizeTask>(shared_from_this(), context, sink, 0, num_blocks, false));
        } else {
            // Parallel finalize
            idx_t num_threads = TaskScheduler::GetScheduler(context).NumberOfThreads();
            auto blocks_per_thread = MaxValue<idx_t>((num_blocks + num_threads - 1) / num_threads, 1);

            idx_t block_idx = 0;
            for (idx_t thread_idx = 0; thread_idx < num_threads; thread_idx++) {
                auto block_idx_start = block_idx;
                auto block_idx_end = MinValue<idx_t>(block_idx_start + blocks_per_thread, num_blocks);
                finalize_tasks.push_back(make_unique<HashJoinFinalizeTask>(shared_from_this(), context, sink,
                                                                           block_idx_start, block_idx_end, true));
                block_idx = block_idx_end;
                if (block_idx == num_blocks) {
                    break;
                }
            }
        }
        SetTasks(move(finalize_tasks));
    }

template <bool PARALLEL>
static inline void InsertHashesLoop(atomic<data_ptr_t> pointers[], const hash_t indices[], const idx_t count,
                                    const data_ptr_t key_locations[], const idx_t pointer_offset) {
    for (idx_t i = 0; i < count; i++) {
        auto index = indices[i];
        if (PARALLEL) {
            data_ptr_t head;
            do {
                head = pointers[index];
                Store<data_ptr_t>(head, key_locations[i] + pointer_offset);
            } while (!std::atomic_compare_exchange_weak(&pointers[index], &head, key_locations[i]));
        } else {
            // set prev in current key to the value (NOTE: this will be nullptr if there is none)
            Store<data_ptr_t>(pointers[index], key_locations[i] + pointer_offset);

            // set pointer to current tuple
            pointers[index] = key_locations[i];
        }
    }
}
  • 并行掃描hash表,進(jìn)行outer數(shù)據(jù)的處理:
void PhysicalHashJoin::GetData(ExecutionContext &context, DataChunk &chunk, GlobalSourceState &gstate_p,
                               LocalSourceState &lstate_p) const {
    auto &sink = (HashJoinGlobalSinkState &)*sink_state;
    auto &gstate = (HashJoinGlobalSourceState &)gstate_p;
    auto &lstate = (HashJoinLocalSourceState &)lstate_p;
    sink.scanned_data = true;

    if (!sink.external) {
        if (IsRightOuterJoin(join_type)) {
            {
                lock_guard<mutex> guard(gstate.lock);
                                // 拆解掃描部分hash表的數(shù)據(jù)
                lstate.ScanFullOuter(sink, gstate);
            }
                        // 掃描hash表讀取數(shù)據(jù)
            sink.hash_table->GatherFullOuter(chunk, lstate.addresses, lstate.full_outer_found_entries);
        }
        return;
    }
}


void HashJoinLocalSourceState::ScanFullOuter(HashJoinGlobalSinkState &sink, HashJoinGlobalSourceState &gstate) {
    auto &fo_ss = gstate.full_outer_scan;
    idx_t scan_index_before = fo_ss.scan_index;
    full_outer_found_entries = sink.hash_table->ScanFullOuter(fo_ss, addresses);
    idx_t scanned = fo_ss.scan_index - scan_index_before;
    full_outer_in_progress = scanned;
}

小結(jié)

  • DuckDB在多線程同步,核心就是在Combine的時(shí)候:加鎖,并發(fā)是通過(guò)原子變量的方式實(shí)現(xiàn)并發(fā)寫入hash表的操作
  • 通過(guò)local/global 拆分私有內(nèi)存和公共內(nèi)存,并發(fā)的基礎(chǔ)是在私有內(nèi)存上進(jìn)行運(yùn)算,同步的部分主要在公有內(nèi)存的更新

3. Spill To Disk的實(shí)現(xiàn)

DuckDB并沒有如筆者預(yù)期的實(shí)現(xiàn)異步IO, 所以任意的執(zhí)行線程是有可能Stall在系統(tǒng)的I/O調(diào)度上的,我想大概率是DuckDB本身的定位對(duì)于高并發(fā)場(chǎng)景的支持不是那么敏感所導(dǎo)致的。這里他們也作為了后續(xù)TODO的計(jì)劃之一。

image.png

4. 參考資料

DuckDB源碼

Push-Based Execution in DuckDB

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容