手把手教你用elasticsearch 8.16.0實(shí)現(xiàn)以圖搜圖

手把手教你用elasticsearch 8.16.0實(shí)現(xiàn)以圖搜圖

從去年就想用elasticsearch搭建一個(gè)以圖搜圖的想法,但時(shí)至今年才實(shí)現(xiàn),現(xiàn)將整個(gè)實(shí)現(xiàn)的過(guò)程及所踩過(guò)的坑整理成一篇博客,供各位網(wǎng)友后續(xù)版本參考。

一、大致過(guò)程

1.1 所需基礎(chǔ)環(huán)境

我是參考這篇博客實(shí)現(xiàn)的 https://www.elastic.co/search-labs/blog/implement-image-similarity-search-elastic

首先你電腦得裝好以下基礎(chǔ)軟件

  • Git
  • Python 3.9+
  • Pycharm
  • Elasticsearch
  • Kibana
  • HuggingFace clip-ViT-B-32-multilingual-v1模型

1.2 大致步驟

首先你得需要非常多的圖片,這樣才能建立起一個(gè)基礎(chǔ)的資料庫(kù),如果沒(méi)有的話,可以寫個(gè)python爬蟲抓一下圖片,不會(huì)的話就只能下載數(shù)據(jù)集咯。

從左邊的Images、Documents、Audio開始說(shuō)起,這部分?jǐn)?shù)據(jù)經(jīng)過(guò)Transform into embedding會(huì)被轉(zhuǎn)換成向量,然后存儲(chǔ)到Neareast neighbor也就是ES中,完成這個(gè)步驟,基本上你就已經(jīng)成功一大半了。后面就是運(yùn)行檢索程序,右邊的檢索程序就是先將你輸入的圖片、文字轉(zhuǎn)換成向量,然后ES再通過(guò)向量余弦計(jì)算,算出相似的圖片,按照得分順序高低排序,選出排名靠前的圖片,這樣一個(gè)以圖搜圖的功能你就完全完成了。

整個(gè)過(guò)程中Kibana需要開啟試用版30天的機(jī)器學(xué)習(xí)功能,注意,整個(gè)過(guò)程你要在30天中完成,否則Kibana就需要收費(fèi)了。

1733660417488

二、實(shí)現(xiàn)過(guò)程

2.1 安裝Elasticsearch和Kibana

這兩個(gè)軟件就不過(guò)多的講了,還是非常簡(jiǎn)單的,之前7.x版本還需要在Kibana手動(dòng)配置es的ca證書,現(xiàn)在通過(guò)token和驗(yàn)證碼就免去了這部分過(guò)程,全程基本上只需要你先執(zhí)行elasticsearch文件,然后在控制臺(tái)找到密碼和token,再啟動(dòng)kibana,本地瀏覽器進(jìn)入5601端口,將token輸入進(jìn)去之后,再在kibana控制臺(tái)找到驗(yàn)證碼輸入到kibana頁(yè)面,這個(gè)過(guò)程你就將es和kibana安裝好了。

最后你在kibana頁(yè)面輸入賬號(hào)elastic,密碼就是你在elasticsearch控制臺(tái)找的密碼,就可以進(jìn)入kibana啦。

上述過(guò)程如果有問(wèn)題,請(qǐng)檢查你的版本是不是8.x,7.x是不支持的,有問(wèn)題請(qǐng)找百度找答案,此處不再贅述。

2.2 拉取flask-elastic-image-search代碼

在控制臺(tái)輸入一下命令

$ git clone https://github.com/radoondas/flask-elastic-image-search.git
$ cd flask-elastic-image-search

在pycharm創(chuàng)建出你的虛擬環(huán)境,或者用conda也可以。

requirements.txt文件

asttokens==3.0.0
certifi==2024.8.30
charset-normalizer==3.4.0
click==8.1.7
colorama==0.4.6
contourpy==1.3.0
cycler==0.12.1
decorator==5.1.1
eland==8.16.0
elastic-transport==8.15.1
elasticsearch==8.16.0
exceptiongroup==1.2.2
executing==2.1.0
exif==1.5.0
filelock==3.16.1
Flask==2.0.2
Flask-WTF==1.0.1
fonttools==4.55.0
fsspec==2024.10.0
huggingface-hub==0.26.3
idna==3.10
importlib_resources==6.4.5
ipython==8.18.1
itsdangerous==2.2.0
jedi==0.19.2
Jinja2==3.1.4
joblib==1.4.2
kiwisolver==1.4.7
MarkupSafe==3.0.2
matplotlib==3.9.3
matplotlib-inline==0.1.7
mpmath==1.3.0
networkx==3.2.1
nltk==3.9.1
numpy==1.26.4
packaging==24.2
pandas==1.5.3
parso==0.8.4
pathlib==1.0.1
Pillow==9.3.0
plum-py==0.8.7
prompt_toolkit==3.0.48
pure_eval==0.2.3
Pygments==2.18.0
pyparsing==3.2.0
python-dateutil==2.9.0.post0
python-dotenv==0.21.1
pytz==2024.2
PyYAML==6.0.2
regex==2024.11.6
requests==2.32.3
safetensors==0.4.5
scikit-learn==1.5.2
scipy==1.13.1
sentence-transformers==3.3.1
sentencepiece==0.2.0
six==1.16.0
stack-data==0.6.3
sympy==1.13.1
threadpoolctl==3.5.0
tokenizers==0.20.3
torch==2.5.0
torchvision==0.20.0
tqdm==4.64.1
traitlets==5.14.3
transformers==4.46.3
typing_extensions==4.12.2
urllib3==2.2.3
wcwidth==0.2.13
Werkzeug==2.2.2
WTForms==3.0.1
zipp==3.21.0
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt

2.3 下載模型

點(diǎn)擊此處看模型地址,下載這個(gè)模型我統(tǒng)計(jì)了一下,大概四五個(gè)方法,先假定你不會(huì)科學(xué)上網(wǎng),就算你會(huì)科學(xué)上網(wǎng),也不可能一下子就搞定這個(gè)過(guò)程。

2.3.1 直接科學(xué)上網(wǎng),跳到2.4

2.3.2 科學(xué)上網(wǎng)之后運(yùn)行2.4失敗,沒(méi)法連接到huggingface

這是因?yàn)槟愕目茖W(xué)上網(wǎng)方案不行,這就沒(méi)有辦法了,只能手動(dòng)下載模型咯,

問(wèn)題
因業(yè)務(wù)需要在本機(jī)測(cè)試embedding分詞模型,使用 huggingface上的transformers 加載模型時(shí),因?yàn)榫W(wǎng)絡(luò)無(wú)法訪問(wèn),不能從 huggingface 平臺(tái)下載模型并加載出現(xiàn)如下錯(cuò)誤。 下面提供幾種模型下載辦法

解決
有三種方式下載模型,一種是通過(guò) huggingface model hub 的按鈕下載,一種是使用 huggingface 的 transformers 庫(kù)實(shí)例化模型進(jìn)而將模型下載到緩存目錄(上述報(bào)錯(cuò)就是這種),另一種是通過(guò) huggingface 的 huggingface_hub 工具進(jìn)行下載。下面介紹兩種方式:

2.3.3 huggingface 按鈕下載

點(diǎn)擊下圖的下載按鈕,把所有文件下載到一個(gè)目錄即可。

因網(wǎng)絡(luò)原因無(wú)法下載可使用訪問(wèn)鏡像[HF-Mirror - Huggingface 鏡像站](https://hf-mirror.com/)
1733661900408

2.3.4 huggingface_hub 工具(推薦)

  • 安裝 huggingface_hub

    python -m pip install huggingface_hub
    
  • 使用 huggingface_hub 的 snapshot_download 函數(shù)下載

    from huggingface_hub import snapshot_download
    snapshot_download(repo_id="BAAI/bge-m3")
    
  • 也可以使用 huggingface_hub 提供的命令行工具(推薦)

    huggingface-cli download BAAI/bge-m3
    

    如果覺(jué)得下載比較慢,使用 huggingface 鏡像提速,可以通過(guò)設(shè)置HF_ENDPOINT環(huán)境變量用以切換下載的地址。

  • 設(shè)置環(huán)境變量

# Linux 系統(tǒng)
export HF_ENDPOINT=https://hf-mirror.com
# Windows 系統(tǒng)
$env:HF_ENDPOINT = "https://hf-mirror.com"
1733662085446
  • 下載模型
huggingface-cli download BAAI/bge-m3

注意:在windows中需要使用管理員啟動(dòng)命令行

1733662116273

關(guān)于 huggingface_hub 的更多用法可閱讀 Download an entire repository。

2.3.5 hf_transfer

另外也可以使用 hf_transfer進(jìn)行 提速,可以與此處我沒(méi)有用到,不展開介紹

Download files from the Hub

  • 安裝 hf_transfer
pip install hf_transfer
  • 設(shè)置環(huán)境變量
export HF_HUB_ENABLE_HF_TRANSFER=1
  • 下載模型
huggingface-cli download internlm/internlm2-chat-7b

最后我安裝的模型截圖

C:\Users\26314\.cache\huggingface\hub\models--sentence-transformers--clip-ViT-B-32-multilingual-v1\snapshots\58edf8cada9e398793dca955574a48cbb7f18be2

image-20241208211221765
image-20241208211322194

2.4 下載數(shù)據(jù)集

http://sbert.net/datasets/unsplash-25k-photos.zip

將你下載的模型和圖片數(shù)據(jù)集運(yùn)行下面的程序測(cè)試一下,如果能成功運(yùn)行起來(lái)就沒(méi)有問(wèn)題啦。

from sentence_transformers import SentenceTransformer, util
from PIL import Image
import glob
import torch
import pickle
import zipfile
from IPython.display import display
from IPython.display import Image as IPImage
import os
from tqdm.autonotebook import tqdm

# Here we load the multilingual CLIP model. Note, this model can only encode text.
# If you need embeddings for images, you must load the 'clip-ViT-B-32' model
model = SentenceTransformer('clip-ViT-B-32-multilingual-v1')

# Next, we get about 25k images from Unsplash
img_folder = 'photos/'
if not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0:
    os.makedirs(img_folder, exist_ok=True)

    photo_filename = 'unsplash-25k-photos.zip'
    if not os.path.exists(photo_filename):  # Download dataset if does not exist
        util.http_get('http://sbert.net/datasets/' + photo_filename, photo_filename)

    # Extract all images
    with zipfile.ZipFile(photo_filename, 'r') as zf:
        for member in tqdm(zf.infolist(), desc='Extracting'):
            zf.extract(member, img_folder)

# Now, we need to compute the embeddings
# To speed things up, we destribute pre-computed embeddings
# Otherwise you can also encode the images yourself.
# To encode an image, you can use the following code:
# from PIL import Image
# img_emb = model.encode(Image.open(filepath))

use_precomputed_embeddings = True

if use_precomputed_embeddings:
    emb_filename = 'unsplash-25k-photos-embeddings.pkl'
    if not os.path.exists(emb_filename):  # Download dataset if does not exist
        util.http_get('http://sbert.net/datasets/' + emb_filename, emb_filename)

    with open(emb_filename, 'rb') as fIn:
        img_names, img_emb = pickle.load(fIn)
    print("Images:", len(img_names))
else:
    # For embedding images, we need the non-multilingual CLIP model
    img_model = SentenceTransformer('clip-ViT-B-32')

    img_names = list(glob.glob('unsplash/photos/*.jpg'))
    print("Images:", len(img_names))
    img_emb = img_model.encode([Image.open(filepath) for filepath in img_names], batch_size=128, convert_to_tensor=True,
                               show_progress_bar=True)


# Next, we define a search function.
def search(query, k=3):
    # First, we encode the query (which can either be an image or a text string)
    query_emb = model.encode([query], convert_to_tensor=True, show_progress_bar=False)

    # Then, we use the util.semantic_search function, which computes the cosine-similarity
    # between the query embedding and all image embeddings.
    # It then returns the top_k highest ranked images, which we output
    hits = util.semantic_search(query_emb, img_emb, top_k=k)[0]

    print("Query:")
    display(query)
    for hit in hits:
        print(img_names[hit['corpus_id']])
        display(IPImage(os.path.join(img_folder, img_names[hit['corpus_id']]), width=200))

search("Two dogs playing in the snow")

#German: A cat on a chair
search("Eine Katze auf einem Stuhl")

#Spanish: Many fish
search("Muchos peces")

#Chinese: A beach with palm trees
search("棕櫚樹的沙灘")

2.5 運(yùn)行程序create-image-embeddings.py

下面需要修改一下es賬號(hào)、密碼和證書才可以運(yùn)行。

import os
import sys
import glob
import time
import json
import argparse
from sentence_transformers import SentenceTransformer
from elasticsearch import Elasticsearch, SSLError
from elasticsearch.helpers import parallel_bulk
from PIL import Image
from tqdm import tqdm
from datetime import datetime
from exif import Image as exifImage

ES_HOST = "https://127.0.0.1:9200/"
ES_USER = "elastic"
ES_PASSWORD = "xB9OzFwRC9-NW4-Ypknf"
ES_TIMEOUT = 3600

DEST_INDEX = "my-image-embeddings"
DELETE_EXISTING = True
CHUNK_SIZE = 100

PATH_TO_IMAGES = "../app/static/photos/**/*.jp*g"
PREFIX = "..\\app\\static\\photos\\"

CA_CERT='../app/conf/ess-cloud.cer'

parser = argparse.ArgumentParser()
parser.add_argument('--es_host', dest='es_host', required=False, default=ES_HOST,
                    help="Elasticsearch hostname. Must include HOST and PORT. Default: " + ES_HOST)
parser.add_argument('--es_user', dest='es_user', required=False, default=ES_USER,
                    help="Elasticsearch username. Default: " + ES_USER)
parser.add_argument('--es_password', dest='es_password', required=False, default=ES_PASSWORD,
                    help="Elasticsearch password. Default: " + ES_PASSWORD)
parser.add_argument('--verify_certs', dest='verify_certs', required=False, default=True,
                    action=argparse.BooleanOptionalAction,
                    help="Verify certificates. Default: True")
parser.add_argument('--thread_count', dest='thread_count', required=False, default=4, type=int,
                    help="Number of indexing threads. Default: 4")
parser.add_argument('--chunk_size', dest='chunk_size', required=False, default=CHUNK_SIZE, type=int,
                    help="Default: " + str(CHUNK_SIZE))
parser.add_argument('--timeout', dest='timeout', required=False, default=ES_TIMEOUT, type=int,
                    help="Request timeout in seconds. Default: " + str(ES_TIMEOUT))
parser.add_argument('--delete_existing', dest='delete_existing', required=False, default=True,
                    action=argparse.BooleanOptionalAction,
                    help="Delete existing indices if they are present in the cluster. Default: True")
parser.add_argument('--ca_certs', dest='ca_certs', required=False,# default=CA_CERT,
                    help="Path to CA certificate.") # Default: ../app/conf/ess-cloud.cer")
parser.add_argument('--extract_GPS_location', dest='gps_location', required=False, default=False,
                    action=argparse.BooleanOptionalAction,
                    help="[Experimental] Extract GPS location from photos if available. Default: False")

args = parser.parse_args()


def main():
    global args
    lst = []

    start_time = time.perf_counter()
    img_model = SentenceTransformer('clip-ViT-B-32')
    duration = time.perf_counter() - start_time
    print(f'Duration load model = {duration}')

    filenames = glob.glob(PATH_TO_IMAGES, recursive=True)
    start_time = time.perf_counter()
    for filename in tqdm(filenames, desc='Processing files', total=len(filenames)):
        image = Image.open(filename)

        doc = {}
        embedding = image_embedding(image, img_model)
        doc['image_id'] = create_image_id(filename)
        doc['image_name'] = os.path.basename(filename)
        doc['image_embedding'] = embedding.tolist()
        doc['relative_path'] = os.path.relpath(filename).split(PREFIX)[1]
        doc['exif'] = {}

        try:
            date = get_exif_date(filename)
            # print(date)
            doc['exif']['date'] = get_exif_date(filename)
        except Exception as e:
            pass

        # Experimental! Extract photo GPS location if available.
        if args.gps_location:
            try:
                doc['exif']['location'] = get_exif_location(filename)
            except Exception as e:
                pass

        lst.append(doc)

    duration = time.perf_counter() - start_time
    print(f'Duration creating image embeddings = {duration}')

    es = Elasticsearch(hosts=ES_HOST)
    if args.ca_certs:
        es = Elasticsearch(
            hosts=[args.es_host],
            verify_certs=args.verify_certs,
            basic_auth=(args.es_user, args.es_password),
            ca_certs=args.ca_certs
        )
    else:
        es = Elasticsearch(
            hosts=[args.es_host],
            verify_certs=args.verify_certs,
            basic_auth=(args.es_user, args.es_password)
        )

    es.options(request_timeout=args.timeout)

    # index name to index data into
    index = DEST_INDEX
    try:
        with open("image-embeddings-mappings.json", "r") as config_file:
            config = json.loads(config_file.read())
            if args.delete_existing:
                if es.indices.exists(index=index):
                    print("Deleting existing %s" % index)
                    es.indices.delete(index=index, ignore=[400, 404])

            print("Creating index %s" % index)
            es.indices.create(index=index,
                              mappings=config["mappings"],
                              settings=config["settings"],
                              ignore=[400, 404],
                              request_timeout=args.timeout)


        count = 0
        for success, info in parallel_bulk(
                client=es,
                actions=lst,
                thread_count=4,
                chunk_size=args.chunk_size,
                timeout='%ss' % 120,
                index=index
        ):
            if success:
                count += 1
                if count % args.chunk_size == 0:
                    print('Indexed %s documents' % str(count), flush=True)
                    sys.stdout.flush()
            else:
                print('Doc failed', info)

        print('Indexed %s documents' % str(count), flush=True)
        duration = time.perf_counter() - start_time
        print(f'Total duration = {duration}')
        print("Done!\n")
    except SSLError as e:
        if "SSL: CERTIFICATE_VERIFY_FAILED" in e.message:
            print("\nCERTIFICATE_VERIFY_FAILED exception. Please check the CA path configuration for the script.\n")
            raise
        else:
            raise


def image_embedding(image, model):
    return model.encode(image)


def create_image_id(filename):
    # print("Image filename: ", filename)
    return os.path.splitext(os.path.basename(filename))[0]

def get_exif_date(filename):
    with open(filename, 'rb') as f:
        image = exifImage(f)
        taken = f"{image.datetime_original}"
        date_object = datetime.strptime(taken, "%Y:%m:%d %H:%M:%S")
        prettyDate = date_object.isoformat()
        return prettyDate

def get_exif_location(filename):
    with open(filename, 'rb') as f:
        image = exifImage(f)
        exif = {}
        lat = dms_coordinates_to_dd_coordinates(image.gps_latitude, image.gps_latitude_ref)
        lon = dms_coordinates_to_dd_coordinates(image.gps_longitude, image.gps_longitude_ref)
        return [lon, lat]


def dms_coordinates_to_dd_coordinates(coordinates, coordinates_ref):
    decimal_degrees = coordinates[0] + \
                      coordinates[1] / 60 + \
                      coordinates[2] / 3600

    if coordinates_ref == "S" or coordinates_ref == "W":
        decimal_degrees = -decimal_degrees

    return decimal_degrees

if __name__ == '__main__':
    main()

下面為運(yùn)行命令

$ cd image_embeddings
$ python3 create-image-embeddings.py --es_host='https://127.0.0.1:9200' \
  --es_user='elastic' --es_password='changeme' \
  --ca_certs='../app/conf/ca.crt'

上面的代碼運(yùn)行完畢之后,你就將所有的圖片轉(zhuǎn)為向量存儲(chǔ)到es中了,整個(gè)工作就已經(jīng)完成了一半了。

2.6 在kibana中安裝模型

首先你要在kibana開啟機(jī)器模型白金版試用期30天,然后才可以執(zhí)行下面的安裝代碼,不然會(huì)出現(xiàn)意想不到的報(bào)錯(cuò)。

image-20241208210015488
image-20241208210131273

下面代碼有兩個(gè)版本,請(qǐng)使用最新版8.16.0,因?yàn)?.6.0有問(wèn)題,我沒(méi)有安裝成功。在運(yùn)行之前同樣需要修改es的賬號(hào)、密碼、CA證書。

image-20241208205600937
import elasticsearch
from pathlib import Path
from eland.common import es_version
from eland.ml.pytorch import PyTorchModel
from eland.ml.pytorch.transformers import TransformerModel

ca_certs_path = "../app/conf/ca.crt"
es = elasticsearch.Elasticsearch("https://elastic:xB9OzFwRC9-NW4-Ypknf@127.0.0.1:9200",
                                 ca_certs=ca_certs_path,
                                 verify_certs=True)
es_cluster_version = es_version(es)

# Load a Hugging Face transformers model directly from the model hub
tm = TransformerModel(model_id="sentence-transformers/clip-ViT-B-32-multilingual-v1", task_type="text_embedding", es_version=es_cluster_version)


# Export the model in a TorchScrpt representation which Elasticsearch uses
tmp_path = "models"
Path(tmp_path).mkdir(parents=True, exist_ok=True)
model_path, config, vocab_path = tm.save(tmp_path)

# Import model into Elasticsearch
ptm = PyTorchModel(es, tm.elasticsearch_model_id())
ptm.import_model(model_path=model_path, config_path=None, vocab_path=vocab_path, config=config)

運(yùn)行上面的程序你就可以將sentence-transformers/clip-ViT-B-32-multilingual-v1安裝到Kibana中了,在kibana中點(diǎn)擊運(yùn)行就可以啦。

image-20241208210336609

2.7 運(yùn)行檢索程序

image-20241208210505900
image-20241208210549464
image-20241208210601793

下面我們就來(lái)看一下最終的運(yùn)行效果

GIF 2024-12-8 21-18-26
image-20241208212046597

結(jié)束語(yǔ)

采用es以圖搜圖是我很早之前就像做的一個(gè)功能,花了3天時(shí)間終于跑通了這套流程,所以寫了一個(gè)博客分享,感謝大家的觀看,有任何問(wèn)題可以在評(píng)論區(qū)留言,我看到之后會(huì)在第一時(shí)間回復(fù)。最后,獻(xiàn)上我自己的完整代碼鏈接供大家參考https://github.com/xuanyuanbao/flask-elastic-image-search。另外,我在GitHub上提交的PR還沒(méi)有通過(guò),估計(jì)作者比較忙,希望這次的PR能夠通過(guò)

寫在最后

編程精選網(wǎng)(www.codehuber.com),程序員的終身學(xué)習(xí)網(wǎng)站已上線!

如果這篇【文章】有幫助到你,希望可以給【JavaGPT】點(diǎn)個(gè)贊??,創(chuàng)作不易,如果有對(duì)【后端技術(shù)】、【前端領(lǐng)域】感興趣的小可愛(ài),也歡迎關(guān)注?????? 【JavaGPT】??????,我將會(huì)給你帶來(lái)巨大的【收獲與驚喜】??????!

本文由博客一文多發(fā)平臺(tái) OpenWrite 發(fā)布!

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容