北条麻妃在线一区二区,国产片日韩,超碰无码观看

向量檢索ANN(Approximate Nearest Neighbor Search)，指的是對于一個query向量，從向量庫中找到和它距離最接近的k個向量。這是一個典型的topk任務(wù)，這里要的不是“精確topk”而是“近似topk”，更多考慮的是精度和計算時間之間的權(quán)衡，犧牲一部分精度來降低計算時間。

Faiss的全稱是Facebook AI Similarity Search，是FaceBook的AI團隊針對大規(guī)模相似度檢索問題開發(fā)的一個工具，使用C++編寫，有python接口，對10億量級的索引可以做到毫秒級檢索的性能。

Faiss提供了各種不同的算法組合，比如犧牲一些精度，來提供檢索的速度；另外一種思路是，預先構(gòu)建索引，通過空間換時間。所以，在看faiss的各種算法實現(xiàn)的時候，需要關(guān)注的幾個指標，檢索速度，內(nèi)存占比，檢索精度。

一個高效的向量檢索模型網(wǎng)網(wǎng)需要滿足下面三個條件才能達到工業(yè)級可用：

實時查詢，支持海量（百億、千億級別）規(guī)模庫量級的；
存儲高效，要求構(gòu)建的向量索引模型數(shù)據(jù)壓縮比高，達到大幅縮減內(nèi)存使占用的目的；
召回精度好，topK有比較好的召回率，跟暴力搜索的結(jié)果相比；

常用的幾種向量距離：歐式距離，向量內(nèi)積，夾角余弦，漢明距離，杰卡德相似距離；

源碼編譯

git clone https://github.com/facebookresearch/faiss.git
git reset --hard dafdff110489db7587b169a0afee8470f220d295  # 我用的是這個版本

# 如果cmake版本太低，這里會提示需要升級cmake
cmake -B build -DFAISS_ENABLE_GPU=OFF -DFAISS_ENABLE_PYTHON=ON -DCMAKE_BUILD_TYPE=Debug -DBUILD_SHARED_LIBS=ON -DBUILD_TESTING=OFF .

# 升級cmake 
# 下載 
wget https://github.com/Kitware/CMake/releases/download/v3.23.0/cmake-3.23.0-linux-x86_64.sh

# 安裝，安裝好cd到對應(yīng)的目錄下查看是否有cmake的bin文件，配置環(huán)境變量
sudo bash ./cmake-3.23.0-linux-x86_64.sh --skip-licence --prefix=/usr

編譯&運行demo


#基于上一步編譯出來的faiss動態(tài)庫，編譯demo
#寫一個makefile，指定faiss動態(tài)庫的路徑和各種編譯依賴

CC = g++
CFLAGS = -g -Wall -I/home/your_path/faiss
LDFLAGS = -L/home/your_path/faiss/faiss/build/faiss -lfaiss
RPATH = -Wl,-rpath=/home/your_path/faiss/faiss/build/faiss

all: flat ivfflat hnsw ivfpq

flat: flat.cc
 $(CC) $(CFLAGS) -o flat flat.cc $(LDFLAGS) $(RPATH)

flat.cc

#include <faiss/IndexFlat.h>
#include <faiss/index_io.h>
#include <iostream>

int main() {
    std::cout << "hello faiss" << std::endl;
    // Define the dimension of the vectors
    int d = 4;

    // Define the number of vectors to index
    int nb = 100000;  // 10w

    // Generate some random data to index
    float *xb = new float[d * nb];   // 64 * 10w 
    for(int i = 0; i < nb; i++) {
        for(int j = 0; j < d; j++) {
            xb[d * i + j] = drand48(); // xb = [[向量1],[向量2] ... ]
        }
    }

    // Create an index
    faiss::IndexFlatL2 index(d);
    std::cout << "index dimension : " << d << std::endl;
    // Index the data
    index.add(nb, xb);  // xb = [dimension * vector num]

    // Save the index to disk
    // faiss::write_index(&index, "example.index");

    // Load the index from disk
    //faiss::IndexFlatL2 index2(d);
    //faiss::read_index(&index2, "example.index");

    // Search the index
    const int k = 5;
    const int query_vector_num = 3;
    long int *I = new long int[k * query_vector_num];  // 5 * 3 
    float *D = new float[k * query_vector_num];        // 5 * 3

    // 從xb中的前query_vector_num個向量作為輸入
    // 從xb中尋找最近的k個返回
    index.search(query_vector_num, xb, k, D, I);

    // Print the results
    for(int i = 0; i < query_vector_num; i++) {
        printf("Query %d:\\n", i);
        for(int j = 0; j < k; j++) {
            show_vector(xb, I[k * i + j], d);
            printf("  %ld (distance=%g)\\n", I[k * i + j], D[k * i + j]);
        }
    }

    // Clean up
    delete[] xb;
    delete[] I;
    delete[] D;
    return 0;
}

這樣一個簡單的使用flat算法的faiss demo就運行起來了。

源碼閱讀

代碼中的一些專有名詞：

flat	暴力檢索
code_size	一個向量長度，默認情況下向量是不壓縮的，長度=sizeof(float) * dim
codes	存放編碼后的向量叫codes，用一維的float數(shù)組表示，存放方式 [[向量1],[向量2] ... ]
metric_type	計算距離的不同方法
centroid	ivfflat算法中的聚類中心
probe	ivfflat算法中搜索聚類中心相鄰的區(qū)域

基類設(shè)計：
faiss的基類接口設(shè)計比較簡單
全量數(shù)據(jù) ：data→add→train(可選）→search
增量數(shù)據(jù) ：data→add→search

Index基類提供了基本的接口，后續(xù)各種繼承基類，實現(xiàn)具體的功能。

// 基類index類，提供了基本的添加向量，檢索topk，訓練（對于需要預訓練的檢索類型）
// 不同的派生類會繼承基類，來實現(xiàn)自定義實現(xiàn)
class Index {
    int d;        ///< vector dimension
    idx_t ntotal; ///< total nb of indexed vectors
    
    bool is_trained;
    
    /// type of metric this index uses for search
    MetricType metric_type;

    /** Add n vectors of dimension d to the index.
     *
     * Vectors are implicitly assigned labels ntotal .. ntotal + n - 1
     * This function slices the input vectors in chunks smaller than
     * blocksize_add and calls add_core.
     * @param n      number of vectors
     * @param x      input matrix, size n * d
     */
    virtual void add(idx_t n, const float* x) = 0;

     /** query n vectors of dimension d to the index.
     *
     * return at most k vectors. If there are not enough results for a
     * query, the result array is padded with -1s.
     *
     * @param n           number of vectors
     * @param x           input vectors to search, size n * d
     * @param k           number of extracted vectors
     * @param distances   output pairwise distances, size n*k
     * @param labels      output labels of the NNs, size n*k
     */
    virtual void search(
            idx_t n,
            const float* x,
            idx_t k,
            float* distances,
            idx_t* labels,
            const SearchParameters* params = nullptr) const = 0;

    /** Perform training on a representative set of vectors
     *
     * @param n      nb of training vectors
     * @param x      training vecors, size n * d
     */
    virtual void train(idx_t n, const float* x);
    
    /** encode a set of vectors
     *
     * @param n       number of vectors
     * @param x       input vectors, size n * d
     * @param bytes   output encoded vectors, size n * sa_code_size()
     */
    virtual void sa_encode(idx_t n, const float* x, uint8_t* bytes) const;

    /** decode a set of vectors
     *
     * @param n       number of vectors
     * @param bytes   input encoded vectors, size n * sa_code_size()
     * @param x       output vectors, size n * d
     */
    virtual void sa_decode(idx_t n, const uint8_t* bytes, float* x) const;
}

1. Flat暴力求解

Flat就是比較簡單的暴力算法求解topk，時間復雜度是O(M*N)。思路是對于一個查詢向量，遍歷整個向量庫依次計算和每一個向量的距離，將每一個距離push到一個堆中。

int main() {
    // 定義向量的維度
    int d = 4;
    // 定義向量的數(shù)量
    int nb = 100000;  // 10w

    // Generate some random data to index
    float *xb = new float[d * nb];   // 64 * 10w 
    for(int i = 0; i < nb; i++) {
        for(int j = 0; j < d; j++) {
            xb[d * i + j] = drand48(); // xb = [[向量1],[向量2] ... ]
        }
    }

    // 創(chuàng)建flat l2 index
    faiss::IndexFlatL2 index(d);
    // 在flat l2 index上添加向量庫
    index.add(nb, xb);  // xb = [dimension * vector num]

    // 檢索topk
    const int k = 5;
    const int query_vector_num = 3;
    // 保存檢索結(jié)果 I返回topk向量的index，D返回topk向量的距離
    long int *I = new long int[k * query_vector_num];  // 5 * 3 
    float *D = new float[k * query_vector_num];        // 5 * 3

    // 開始檢索
    // 從xb中的前query_vector_num個向量作為query向量，尋找topk，結(jié)果保存到D，I數(shù)組中
    index.search(query_vector_num, xb, k, D, I);

    // Print the results
    for(int i = 0; i < query_vector_num; i++) {
        printf("Query %d:\n", i);
        for(int j = 0; j < k; j++) {
            show_vector(xb, I[k * i + j], d);
            printf("  %ld (distance=%g)\n", I[k * i + j], D[k * i + j]);
        }
    }

    // Clean up
    delete[] xb;
    delete[] I;
    delete[] D;
    return 0;
}

暴力的計算topk代碼：

// x: query vector 
// y: data base
// d: dimension
// nx: the number of query vector number 
// ny: the number of data base total
// res: result 
template <class BlockResultHandler, bool use_sel = false>
void exhaustive_L2sqr_seq(
        const float* x,
        const float* y,
        size_t d,
        size_t nx,
        size_t ny,
        BlockResultHandler& res,
        const IDSelector* sel = nullptr) {
    using SingleResultHandler =
            typename BlockResultHandler::SingleResultHandler;
    int nt = std::min(int(nx), omp_get_max_threads());

    FAISS_ASSERT(use_sel == (sel != nullptr));

// #pragma omp parallel num_threads(nt)   //open mp 多線程優(yōu)化
    {
        SingleResultHandler resi(res);
// #pragma omp for
        // 從查詢的nx個向量里面依次遍歷 （對于每個query x，找到它的topk）
        for (int64_t i = 0; i < nx; i++) {
            const float* x_i = x + i * d;
            const float* y_j = y;
            resi.begin(i);
            // 和向量庫里面的向量依次比較
            // j 表示的向量的index， y_j 表示的是向量的起始地址
            for (size_t j = 0; j < ny; j++, y_j += d) {  
                if (use_sel && !sel->is_member(j)) {
                    continue;
                }
                float disij = fvec_L2sqr(x_i, y_j, d);  // 計算距離 這里沒有開平方根
                resi.add_result(disij, j);              // 加入堆中
            }
            resi.end();
        }
    }
}

2. IVFFlat

IVF表示的是inverted file ，相較于暴力匹配多了一步預篩的作用。目的是減少需要計算距離的目標向量的個數(shù)，做法就是直接對所有向量做k-means。

IVFFlat的基本思路是：假設(shè)聚類產(chǎn)出了1024個聚類中心，那么每來一個查詢向量，首先計算其與1024個聚類簇心的距離，然后選擇距離最近的top N個簇，只計算查詢向量與這幾個簇底下的向量的距離。這樣計算的復雜度降低為 O（M * N/cluster_num）

IVFFlat有什么問題，是否可以找到最優(yōu)解？
如果query的向量處于兩個聚類的邊緣，很有可能被分到另外一個聚類中心，導致無法計算出最有結(jié)果。但是我們可以通過各種方法提高它的精度。比如，找到與 xq 相近的幾個 centroids, 然后在這幾個區(qū)域內(nèi)搜索. 我們每次搜索的范圍(scope)覆蓋的區(qū)域數(shù)量。在 faiss 里面叫 nprobe, 我們可以提高 nprobe 的數(shù)量, 以提高精確度。

int main() {
    // ...
    int nlist = 100;    // 聚類中心
    int k = 4;          // top k 
    faiss::IndexFlatL2 quantizer(d);
    faiss::IndexIVFFlat index(&quantizer, d, nlist);
    assert(!index.is_trained);
    
    // step1 :訓練 
    // 目的是產(chǎn)生100個聚類中心
    index.train(nb, xb);
    assert(index.is_trained);
    
    // step2：add添加向量
    // 根據(jù)第一步計算出來的聚類中心，將向量添加到離他最近的聚類中心 (每個聚類中心都有對應(yīng)的倒排向量）
    index.add(nb, xb);
    
    // step3：查詢
    // 根據(jù)聚類中心和nprob（搜索的相鄰的聚類中心），找到對應(yīng)聚類中心的倒排鏈，然后在倒排鏈上完成檢索
    idx_t* I = new idx_t[k * nq];
    float* D = new float[k * nq];
    index.search(nq, xq, k, D, I);   // nq : the number of query , xy : query database 
}

IndexIVF 類設(shè)計：

// IndexIVF 繼承了 Index, 同時繼承了 L1 量化器
struct IndexIVF : Index, Level1Quantizer {

    InvertedLists* invlists;   // 向量編碼后存儲在倒排表里面(包括很多個倒排文件)  // 本質(zhì)是兩個vector嵌套的二維數(shù)組
    bool own_invlists;
 
    size_t code_size; // 一個d維向量最后被編碼為多少字節(jié)
 
    size_t nprobe;    // 搜索的時候, 在幾個區(qū)域里搜
    size_t max_codes; // 一次搜索最多在幾個區(qū)域里搜, 用于覆蓋 nprobe 參數(shù)
        
    // OpenMP 并行化模式
    // 0: 默認情況, 將請求拆分
    // 1: 在倒排表上進行并行化
    // 2: 同時使用0和1
    // 3: 將請求拆分為更細粒度
    int parallel_mode;
    const int PARALLEL_MODE_NO_HEAP_INIT = 1024;
        
    // 記錄向量id與倒排表中倒排文件的關(guān)系, 可選
    // 用于支持 reconstruct 方法, 暫時不用關(guān)注
    DirectMap direct_map;
 
    IndexIVF(
            Index* quantizer,
            size_t d,
            size_t nlist,
            size_t code_size,
            MetricType metric = METRIC_L2);
 
    void reset() override;
 
    void train(idx_t n, const float* x) override;
 
    void add(idx_t n, const float* x) override;
 
    void add_with_ids(idx_t n, const float* x, const idx_t* xids) override;
 
    void search(
            idx_t n,
            const float* x,
            idx_t k,
            float* distances,
            idx_t* labels) const override;
 
    size_t remove_ids(const IDSelector& sel) override;
 
    ~IndexIVF() override;
 
   ...
};

step1 train:

// 訓練 L1 量化器以及子量化器
// n 為訓練集向量數(shù)量(向量維度d)
// x 為訓練集(長度 n*d)
void IndexIVF::train(idx_t n, const float* x) {
    if (verbose)    // 打印詳情, debug 用, 默認 false
        printf("Training level-1 quantizer\n");
        
    // 該方法繼承自 Level1Quantizer, 訓練 L1 量化器
    train_q1(n, x, verbose, metric_type);
 
    if (verbose)
        printf("Training IVF residual\n");
        
    // 通過殘差訓練子量化器, 默認的實現(xiàn)是啥也不干, 先不管它
    train_residual(n, x);
    
    // 標記我們的 IVF 索引已經(jīng)訓練好了, 可以用了
    is_trained = true;
}

step2 add:

void IndexIVF::add_with_ids(idx_t n, const float* x, const idx_t* xids) {
    std::unique_ptr<idx_t[]> coarse_idx(new idx_t[n]);
    // 1. 找到添加的向量對應(yīng)的聚類中心
    quantizer->assign(n, x, coarse_idx.get());
    // 2. 開始添加
    add_core(n, x, xids, coarse_idx.get());
}

// n : 添加的向量的數(shù)量
// x : 向量數(shù)據(jù)
// xids：nullptr 
// coarse_idx: 對應(yīng)的聚類中心數(shù)組
void IndexIVF::add_core(
        idx_t n,
        const float* x,
        const idx_t* xids,
        const idx_t* coarse_idx,
        void* inverted_list_context) {
 
    FAISS_THROW_IF_NOT(coarse_idx);
    FAISS_THROW_IF_NOT(is_trained);
    direct_map.check_can_add(xids);

    size_t nadd = 0, nminus1 = 0;

    for (size_t i = 0; i < n; i++) {
        if (coarse_idx[i] < 0)
            nminus1++;
    }

    std::unique_ptr<uint8_t[]> flat_codes(new uint8_t[n * code_size]);
    encode_vectors(n, x, coarse_idx, flat_codes.get());

    DirectMapAdd dm_adder(direct_map, n, xids);

#pragma omp parallel reduction(+ : nadd)
    {
        int nt = omp_get_num_threads();
        int rank = omp_get_thread_num();

        // each thread takes care of a subset of lists
        for (size_t i = 0; i < n; i++) {
            idx_t list_no = coarse_idx[i];
            if (list_no >= 0 && list_no % nt == rank) {
                idx_t id = xids ? xids[i] : ntotal + i;
                size_t ofs = invlists->add_entry( 
                        list_no,
                        id,
                        flat_codes.get() + i * code_size,
                        inverted_list_context);

                dm_adder.add(i, list_no, ofs);

                nadd++;
            } else if (rank == 0 && list_no == -1) {
                dm_adder.add(i, -1, 0);
            }
        }
    }
    ntotal += n;
}

step3 search:

void IndexIVF::search(
        idx_t n,
        const float* x,
        idx_t k,
        float* distances,
        idx_t* labels,
        const SearchParameters* params_in) const {
    FAISS_THROW_IF_NOT(k > 0);
    const size_t nprobe =
            std::min(nlist, params ? params->nprobe : this->nprobe);   // nprob 表示查找?guī)讉€區(qū)域 默認是=1
    FAISS_THROW_IF_NOT(nprobe > 0);

    // search function for a subset of queries
    auto sub_search_func = [this, k, nprobe, params](
                                   idx_t n,
                                   const float* x,
                                   float* distances,
                                   idx_t* labels,
                                   IndexIVFStats* ivf_stats) {
        std::unique_ptr<idx_t[]> idx(new idx_t[n * nprobe]);         // 通過query的index，找到它的聚類中心的index
        std::unique_ptr<float[]> coarse_dis(new float[n * nprobe]);  // 通過query的index，找到它到聚類中心的距離
        double t0 = getmillisecs();   
        // IndexFlat::search  這里應(yīng)該是找到query向量的聚類中心向量
        quantizer->search(
                n,
                x,
                nprobe,
                coarse_dis.get(),
                idx.get(),
                params ? params->quantizer_params : nullptr);

        double t1 = getmillisecs();
        // 預取所需的倒排文件, 默認實現(xiàn)是啥也不干(倒排文件放磁盤的時候才預取)
        invlists->prefetch_lists(idx.get(), n * nprobe);
        // 真正的搜索
        search_preassigned(
                n,
                x,
                k,
                idx.get(),
                coarse_dis.get(),
                distances,
                labels,
                false,
                params,
                ivf_stats);
        double t2 = getmillisecs();
        ivf_stats->quantization_time += t1 - t0;
        ivf_stats->search_time += t2 - t0;
    };

    if ((parallel_mode & ~PARALLEL_MODE_NO_HEAP_INIT) == 0) {
        // ... 先忽略
    } else {
        // 入口
        sub_search_func(n, x, distances, labels, &indexIVF_stats);
    }
}

堆設(shè)計

// CMin 泛型類
// 提供T對象的cmp操作，min類就是比較誰最小;
// T1類型支持傳入多個參數(shù)，如果T相同，那么支持比較T1
template <typename T_, typename TI_>
struct CMin {
    typedef T_ T;
    typedef TI_ TI;
    typedef CMax<T_, TI_> Crev; // reference to reverse comparison
    
    // 提供cmp比較誰更小
    inline static bool cmp(T a, T b) {
        return a < b;
    }
    inline static bool cmp2(T a1, T b1, TI a2, TI b2) {
        return (a1 < b1) || ((a1 == b1) && (a2 < b2));
    }
    
    // 返回T類型的最小值
    inline static T neutral() {
        return std::numeric_limits<T>::lowest();
    }
    static const bool is_max = false;

    // 返回T類型的下一個值
    inline static T nextafter(T x) {
        return cmin_nextafter(x);
    }
};

// 使用方式2：
using HeapForIP2 = CMin<uint16_t, int64_t>;
HeapForIP2 heap2;
std::cout << heap2.cmp(100, 200) << std::endl;
std::cout << heap2.neutral() << std::endl;
std::cout << heap2.nextafter(100) << std::endl;

// 使用方式1：
// IP內(nèi)積實際上是算余弦相似度, 越大越相似, 使用最小堆求 topK 個最大值
// L2計算里氏距離, 越小越相似, 使用最大堆求 topK 個最小值
using HeapForIP = CMin<float, idx_t>;
using HeapForL2 = CMax<float, idx_t>;
heap_heapify<HeapForIP>(k, simi, idxi);

template <class C>
inline void heap_heapify(
        size_t k,
        typename C::T* bh_val,
        typename C::TI* bh_ids,
        const typename C::T* x = nullptr,
        const typename C::TI* ids = nullptr,
        size_t k0 = 0) {
    if (k0 > 0)
        assert(x);

    if (ids) {
        for (size_t i = 0; i < k0; i++)
            heap_push<C>(i + 1, bh_val, bh_ids, x[i], ids[i]);
    } else {
        for (size_t i = 0; i < k0; i++)
            heap_push<C>(i + 1, bh_val, bh_ids, x[i], i);
    }

    for (size_t i = k0; i < k; i++) {
        bh_val[i] = C::neutral();
        bh_ids[i] = -1;
    }
}

template <class C>
inline void heap_push(
        size_t k,
        typename C::T* bh_val,
        typename C::TI* bh_ids,
        typename C::T val,
        typename C::TI id) {
    bh_val--; /* Use 1-based indexing for easier node->child translation */
    bh_ids--;
    size_t i = k, i_father;
    while (i > 1) {
        i_father = i >> 1;
        if (!C::cmp2(val, bh_val[i_father], id, bh_ids[i_father])) {
            /* the heap structure is ok */
            break;
        }
        bh_val[i] = bh_val[i_father];
        bh_ids[i] = bh_ids[i_father];
        i = i_father;
    }
    bh_val[i] = val;
    bh_ids[i] = id;
}

IVFFlat的主要參數(shù)：

向量維度	d ：128
metrice_type	ip內(nèi)積
nlist	nlist=3000 聚類區(qū)域個數(shù)
niter	niter = 25 聚類迭代次數(shù)
nredo	nredo = 1 聚類重復計算次數(shù)，取最小結(jié)果
nprobe	nprobe=3 搜索查找相鄰聚類中心的數(shù)量

3. IVFPQ

IVFFlat 可以在犧牲一部分精度的前提下，提高檢索的速度。但是flat的方式，并不會向量進行壓縮，為了節(jié)約內(nèi)存提出了ivfpq的算法。PQ 是 Product quantization（乘積量化）的縮寫，按照文章里面的測試，可以降低90%多的內(nèi)存占用。

IVFPQ 的思路是：

首先對向量庫的向量按照維度進行切分，比如把 n * 128維的向量切分為 n * （8 個 16維）的子向量；
分別對子向量進行聚類（ivf提現(xiàn)在這里），比如每個子向量產(chǎn)出 256個聚類中心，一共 8 * 256 個聚類中心；
對于一個128維的向量而來，可以分為8個子向量，每個子向量都可以找到對應(yīng)的聚類中心。那么，就可以用對應(yīng)的聚類中心序號來代替這個向量。[1,2,xxxxxx, 127] --> [c1,c2,c3,c4,c5,c6,c7,c8]。這里c的值域空間是0~255，只需要8個bit就可以存下來；這樣就可以把一個很長的向量用很小的空間保存下來了。（pq核心思想）
對于查詢向量q來說，首先先把自己量化，變?yōu)閇c1,c3,c4,c2,xxxx,cn]。然后，在量化過后的向量庫上做topk，找到和自己最近的k個量化后的向量。然后，將這個k個向量轉(zhuǎn)換為對應(yīng)的原始向量。計算出距離。

int main() {
    int nlist = 100;                 // 聚類中心數(shù)量
    int k = 4;                       // topk
    int m = 8;                       // number of subquantizers
    faiss::IndexFlatL2 quantizer(d); 
    faiss::IndexIVFPQ index(&quantizer, d, nlist, m, 8);
    index.train(nb, xb);
    index.add(nb, xb);

    idx_t* I = new idx_t[k * 5];
    float* D = new float[k * 5];
    index.search(5, xb, k, D, I);
            
    return 0;
}

4. HNSW

HNSW是多層 negivate small world的縮寫，HNSW是最早是Yury Malkov在這篇論文中提出來的 HNSW，感興趣的可以看原文。HNSW的核心思想是利用了圖論的思想（首先假設(shè)有一張圖，先不考慮這張圖是如何構(gòu)建的），通過隨機找一個入口點，從入口點開始按照各種算法去遍歷圖，直到直到相近的點。簡單的來說，HNSW像是一張構(gòu)建在圖上的多層跳表，檢索的時候利用跳表快速尋找到最近的點。

HNSW可以分成構(gòu)建（build）和檢索（search）兩個階段。構(gòu)建階段的主要目的是，根據(jù)向量去構(gòu)建去一張多層的圖；檢索階段，主要是根據(jù)query 向量在上一階段構(gòu)建出來的圖進行查詢，從圖的最高層向下檢索直到找到最近的topk個向量。

構(gòu)建（build）階段
1. 利用指數(shù)分布的函數(shù)，隨機產(chǎn)出每個向量點映射到的最高層；這一步選擇出了哪些節(jié)點在哪些層；
2. 從最高層開始向下，根據(jù)上一步得到的每一層的節(jié)點，去構(gòu)建每一層的鄰接關(guān)系；
3. 每一層的構(gòu)建邏輯是，每個節(jié)點依次插入
  3.1 插入的時候，在目前正在構(gòu)建的圖中開始貪心檢索；
  3.2 從入口的entry point開始，找到entry point的所有鄰居中距離插入向量最近的點；
  3.3 把這個最近的點設(shè)置為新的entry point，重復上一步；
  3.4 直到計算出來的距離 = prev_計算出來的距離（說明達到局部最優(yōu)），然后跳出循環(huán)；
  3.5 這里找到了一個鄰居，其他的m-1個通過bfs找到；
4. 剪枝（shrink）
  當鄰居的數(shù)量超過了m，需要進行剪枝。這里剪枝的主要出發(fā)點是，盡可能的去保留“高速高路”，即不要一堆相鄰的點聚集在一起，而是保留一些離的很遠的點，這么做的目的是，讓跳表的的范圍比較廣。這個就是hnsw論文里面精華的核心思想了。

build階段的產(chǎn)出是兩個，一個是存儲向量的文件（codes），一個是存儲圖的關(guān)系的文件（index）。下一步檢索就是基于兩個文件尋找query向量的topk鄰居。

檢索（search）階段
1. 從圖的最高層entry point開始向下檢索，一層一層的向下貪心查找，找到每一層最近的鄰居，直到找到0層；
2. 在第0層開始，從上一步找到的entry point開始，進行bfs（廣度優(yōu)先遍歷），遍歷的同時維護一個最大堆，保留這個過程找到的topk
3. 直到bfs的次數(shù)達到的ef-search的上線，停止遍歷。堆里面的元素就是找到的topk。

代碼：
build階段

#include <iostream>
#include <faiss/IndexHNSW.h>
#include <faiss/IndexFlat.h>
#include <faiss/index_io.h>
#include <iostream>

int main() {
    int d  = 64;     // 向量維度
    int nb = 10000;  // 向量數(shù)量

    // 創(chuàng)建一個隨機的數(shù)據(jù)集
    float* xb = new float[nb * d];
    for (int i = 0; i < nb * d; i++) {
        xb[i] = drand48();
    }

    // 構(gòu)建索引
    faiss::IndexHNSWFlat index(d, 32);
    index.add(nb, xb);

    // Save the index to disk
    faiss::write_index(&index, "example.index");

    delete[] xb;
    return 0;
}

search階段

這是另外一段代碼：
#include <iostream>
#include <faiss/IndexHNSW.h>
#include <faiss/IndexFlat.h>
#include <faiss/index_io.h>
#include <iostream>

int main() {
    int d = 64;         // 向量維度
    int k = 5;          // 查詢的最近鄰數(shù)量

    // 進行查詢
    faiss::Index* index = faiss::read_index("example.index");

    float* xq = new float[d];
    for (int i = 0; i < d; i++) {
        xq[i] = drand48();
    }

    long int* I = new long int[k];
    float* D = new float[k];
    index->search(1, xq, k, D, I);

    // 打印查詢結(jié)果
    std::cout << "Nearest neighbors:" << std::endl;
    for (int i = 0; i < k; i++) {
        std::cout << "Index: " << I[i] << ", Distance: " << D[i] << std::endl;
    }
    delete[] xq;
    delete[] I;
    delete[] D;
    return 0;
}

4. faiss中關(guān)于計算效率的優(yōu)化

faiss 大面積的使用了open mp來進行平行計算來提供計算效率，這種場景讀多寫少，for（i~n）循環(huán)計算的邏輯非常適合用open mp改造。舉一個簡單的例子：

#include <iostream>
#include <omp.h>
#include <vector>
#include <chrono>

int main() {
    int n = 100000000;
    std::vector<int> nums(n, 1);
    int sum = 0;

    auto start = std::chrono::high_resolution_clock::now();

    // 無優(yōu)化的串行計算
    for (int i = 0; i < n; i++) {
        sum += nums[i];
    }

    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> serial_time = end - start;
    std::cout << "Serial sum: " << sum << " - Time: " << serial_time.count() << " seconds" << std::endl;

    sum = 0; // 重置sum
    start = std::chrono::high_resolution_clock::now();

    // 使用OpenMP并行計算
    #pragma omp parallel for reduction(+:sum)
    for (int i = 0; i < n; i++) {
        sum += nums[i];
    }

    end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> parallel_time = end - start;

    std::cout << "Parallel sum: " << sum << " - Time: " << parallel_time.count() << " seconds" << std::endl;
    std::cout << "Speedup: " << serial_time.count() / parallel_time.count() << std::endl;
    return 0;
}

Serial sum: 100000000 - Time: 0.234739 seconds
Parallel sum: 100000000 - Time: 0.0501435 seconds
Speedup: 4.68134

open mp這種并行編程模型，提供了更高級的并行抽象，也更容易改造基于for循環(huán)的串行代碼。

5. 小結(jié)

Faiss不同索引的對比：
1. 不在意內(nèi)存：選擇HNSW；
2. 不在意時間：XXXFlat；
3. 在意內(nèi)存，在意時間，不要在意精度： PQ/IVFFALT

參考：

hsnw ： https://www.pinecone.io/learn/series/faiss/hnsw/

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Faiss源碼解析

Faiss源碼解析

源碼編譯

源碼閱讀

1. Flat暴力求解

2. IVFFlat

3. IVFPQ

4. HNSW

4. faiss中關(guān)于計算效率的優(yōu)化

5. 小結(jié)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Faiss源碼解析

源碼編譯

源碼閱讀

1. Flat暴力求解

2. IVFFlat

3. IVFPQ

4. HNSW

4. faiss中關(guān)于計算效率的優(yōu)化

5. 小結(jié)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av