原文地址：https://alphahinex.github.io/2024/12/22/vllm-multi-node-inference/

description: "本文記錄了在兩臺機器，每臺機器一塊 Tesla T4 顯卡的環(huán)境下，使用 vLLM 部署 Qwen2.5-32B-Instruct-GPTQ-Int4 模型的過程及遇到的問題，供類似環(huán)境使用 vLLM 進行多節(jié)點多卡推理參考。"
date: 2024.12.22 10:26
categories:
- AI
tags: [AI, Python, vLLM]
keywords: vllm, gptq, gptq_marlin, tensor-parallel-size, Qwen2.5-32B-Instruct-GPTQ-Int4, multi-node inference, docker, nvidia container toolkit, max-model-len, gpu-memory-utilization, tesla t4

本文記錄了在兩臺機器，每臺機器一塊 Tesla T4 顯卡的環(huán)境下，使用 vLLM 部署 Qwen2.5-32B-Instruct-GPTQ-Int4 模型的過程及遇到的問題，供類似環(huán)境使用 vLLM 進行多節(jié)點多卡推理參考。

部署清單

Qwen2.5-32B-Instruct-GPTQ-Int4、vLLM
docker v27.4.0、nvidia-container-toolkit v1.17.3
Tesla T4 顯卡驅動 v550.127.08 CUDA12.4

部署包準備

# qwen
$ git clone https://www.modelscope.cn/Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4.git

# vllm image
$ docker pull vllm/vllm-openai:v0.6.4.post1

# export
$ docker save vllm/vllm-openai:v0.6.4.post1 | gzip > images.tar.gz

更新顯卡驅動

需要更新至 cuda>=12.4，以運行 vLLM 容器。

# 先卸載之前安裝的驅動 
$ sh ./NVIDIA-Linux-x86_64-550.127.08.run --uninstall 
# 再安裝驅動 
$ sh ./NVIDIA-Linux-x86_64-550.127.08.run 
# 檢測驅動 
$ nvidia-smi

Docker

Docker Engine

$ tar -xzf docker-27.4.0.tgz
$ cp docker/* /usr/local/bin/
$ docker -v

將 https://github.com/containerd/containerd/blob/main/containerd.service 內容保存至 /usr/lib/systemd/system/containerd.service：

# Copyright The containerd Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target local-fs.target dbus.service

[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/containerd

Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=5

# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity

# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
OOMScoreAdjust=-999

[Install]
WantedBy=multi-user.target

$ systemctl enable --now containerd
$ systemctl status containerd

將下面內容保存至 /usr/lib/systemd/system/docker.service：

[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target firewalld.service
Wants=network-online.target

[Service]
Type=notify
ExecStart=/usr/local/bin/dockerd
ExecReload=/bin/kill -s HUP $MAINPID
TimeoutStartSec=0
RestartSec=2
Restart=always
StartLimitBurst=3
StartLimitInterval=60s
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
Delegate=yes
KillMode=process
OOMScoreAdjust=-500

[Install]
WantedBy=multi-user.target

$ systemctl enable --now docker
$ systemctl status docker

Nvidia Container Toolkit

$ tar -xzf nvidia-container-toolkit_1.17.3_rpm_x86_64.tar.gz
$ cd release-v1.17.3-stable/packages/centos7/x86_64
$ rpm -i libnvidia-container1-1.17.3-1.x86_64.rpm
$ rpm -i libnvidia-container-tools-1.17.3-1.x86_64.rpm
$ rpm -i nvidia-container-toolkit-base-1.17.3-1.x86_64.rpm
$ rpm -i nvidia-container-toolkit-1.17.3-1.x86_64.rpm 
# 檢查安裝情況
$ nvidia-ctk -h
# 配置 Nvidia Container Runtime
$ nvidia-ctk runtime configure --runtime=docker
# 檢查配置
$ cat /etc/docker/daemon.json
# 重啟 docker
$ systemctl restart docker
# 重啟服務后執(zhí)行如下命令查看效果：
$ docker info | grep Runtimes
 Runtimes: io.containerd.runc.v2 nvidia runc

Qwen

1. 校驗模型文件

942d93a82fb6d0cb27c940329db971c1e55da78aed959b7a9ac23944363e8f47  model-00001-of-00005.safetensors
19139f34508cb30b78868db0f19ed23dbc9f248f1c5688e29000ed19b29a7eef  model-00002-of-00005.safetensors
d0f829efe1693dddaa4c6e42e867603f19d9cc71806df6e12b56cc3567927169  model-00003-of-00005.safetensors
3a5a428f449bc9eaf210f8c250bc48f3edeae027c4ef8ae48dd4f80e744dd19e  model-00004-of-00005.safetensors
c22a1d1079136e40e1d445dda1de9e3fe5bd5d3b08357c2eb052c5b71bf871fe  model-00005-of-00005.safetensors

$ cd /root/model/Qwen2.5-32B-Instruct-GPTQ-Int4
$ sha256sum *.safetensors > sum.txt

2. 配置集群

在兩臺機器分別準備好 vllm/vllm-openai:v0.6.4.post1 鏡像后，將 https://github.com/vllm-project/vllm/blob/main/examples/run_cluster.sh 存放至 /root/model/：

#!/bin/bash

# Check for minimum number of required arguments
if [ $# -lt 4 ]; then
    echo "Usage: $0 docker_image head_node_address --head|--worker path_to_hf_home [additional_args...]"
    exit 1
fi

# Assign the first three arguments and shift them away
DOCKER_IMAGE="$1"
HEAD_NODE_ADDRESS="$2"
NODE_TYPE="$3"  # Should be --head or --worker
PATH_TO_HF_HOME="$4"
shift 4

# Additional arguments are passed directly to the Docker command
ADDITIONAL_ARGS=("$@")

# Validate node type
if [ "${NODE_TYPE}" != "--head" ] && [ "${NODE_TYPE}" != "--worker" ]; then
    echo "Error: Node type must be --head or --worker"
    exit 1
fi

# Define a function to cleanup on EXIT signal
cleanup() {
    docker stop node
    docker rm node
}
trap cleanup EXIT

# Command setup for head or worker node
RAY_START_CMD="ray start --block"
if [ "${NODE_TYPE}" == "--head" ]; then
    RAY_START_CMD+=" --head --port=6379"
else
    RAY_START_CMD+=" --address=${HEAD_NODE_ADDRESS}:6379"
fi

# Run the docker command with the user specified parameters and additional arguments
docker run \
    --entrypoint /bin/bash \
    --network host \
    --name node \
    --shm-size 10.24g \
    --gpus all \
    -v "${PATH_TO_HF_HOME}:/root/.cache/huggingface" \
    "${ADDITIONAL_ARGS[@]}" \
    "${DOCKER_IMAGE}" -c "${RAY_START_CMD}"

選擇節(jié)點1 作為 head node，節(jié)點2 作為 worker node。

在節(jié)點1 執(zhí)行：

nohup bash run_cluster.sh \
    vllm/vllm-openai:v0.6.4.post1 \
    IP_OF_HEAD_NODE \
    --head \
    /root/model > nohup.log 2>&1 &

在節(jié)點2 執(zhí)行：

nohup bash run_cluster.sh \
    vllm/vllm-openai:v0.6.4.post1 \
    IP_OF_HEAD_NODE \
    --worker \
    /root/model > nohup.log 2>&1 &

注意：兩個節(jié)點執(zhí)行腳本指定的都是 head 節(jié)點的 IP。

在任意節(jié)點通過 docker exec -ti node bash 進入容器：

# 查看集群狀態(tài)
$ ray status

3. 啟動 vLLM 服務

在節(jié)點1 的容器中啟動服務（按當前顯卡配置，GPU 利用率 90% 的前提下，只能將原始模型 32k 的上下文長度縮減到 4k）：

# 根據 2 個節(jié)點和每個節(jié)點 1 個 GPU 設置總的 tensor-parallel-size
$ nohup vllm serve /root/.cache/huggingface/Qwen2.5-32B-Instruct-GPTQ-Int4 \
    --served-model-name Qwen2.5-32B-Instruct-GPTQ-Int4 \
    --tensor-parallel-size 2 --max-model-len 4096 \
    > vllm_serve_qwen_nohup.log 2>&1 &

參數調整過程

默認 gpu-memory-utilization（0.9）時，日志中輸出的 # GPU blocks 為 0。

No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine. —— --gpu-memory-utilization 0.95

調整 gpu-memory-utilization 為 0.95 后，# GPU blocks: 271，271 * 16 = 4336，即下面報錯中的 KV cache token 數。

The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (4336). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. —— --max_model_len 4096

添加 --max-model-len 4096 后，# GPU blocks: 1548

4. 驗證對話接口

curl --request POST \
  -H "Content-Type: application/json" \
  --url http://IP_OF_HEAD_NODE:8000/v1/chat/completions \
  --data '{"messages":[{"role":"user","content":"我希望你充當 IT 專家。我會向您提供有關我的技術問題所需的所有信息，而您的職責是解決我的問題。你應該使用你的計算機科學、網絡基礎設施和 IT 安全知識來解決我的問題。在您的回答中使用適合所有級別的人的智能、簡單和易于理解的語言將很有幫助。用要點逐步解釋您的解決方案很有幫助。盡量避免過多的技術細節(jié)，但在必要時使用它們。我希望您回復解決方案，而不是寫任何解釋。我的第一個問題是“我的筆記本電腦出現藍屏錯誤”。"}],"stream":true,"model":"Qwen2.5-32B-Instruct-GPTQ-Int4"}'

必須設置 Content-Type 請求頭，否則會報 500 的錯誤：[Bug]: Missing Content Type returns 500 Internal Server Error instead of 415 Unsupported Media Type

回復都是！

we currently find two workarounds

use gptq_marlin, which is available for Ampere and later cards.

change the number on this line from 50 to 0 and install from the modified source code. it may affect speed on short sequences though.
—— https://github.com/QwenLM/Qwen2.5/issues/1103#issuecomment-2507022590

目前 Qwen 和 vLLM 社區(qū)均向項目開發(fā)者報告了類似問題，jklj077 暫時給出了兩個繞過方案：

需要修改模型文件中的 config.json，將其中的 "quant_method": "gptq", 修改為 "quant_method": "gptq_marlin",，但需要顯卡算力在 8.0 以上；
需要修改 vLLM 源碼，之后使用修改后源碼安裝。

5. 驗證補全接口

curl --request POST \
  -H "Content-Type: application/json" \
  --url http://IP_OF_HEAD_NODE:8000/v1/completions \
  --data '{"prompt":"who r u?","model":"Qwen2.5-32B-Instruct-GPTQ-Int4"}'

參考資料

nvidia顯卡驅動安裝
Centos7.9離線安裝Docker24(無坑版)_centos7.9 離線安裝docker-CSDN博客
用 PaddleNLP 結合 CodeGen 實現離線 GitHub Copilot - Alpha Hinex's Blog
[Usage]: vllm infer with 2 * Nvidia-L20, output repeat !!!!
[Bug]: Qwen2.5-32B-GPTQ-Int4 inference !!!!!
Distributed Inference and Serving
vLLM - Multi-Node Inference and Serving
大模型推理:vllm多機多卡分布式本地部署_vllm 多卡部署-CSDN博客
vLLM分布式多GPU Docker部署踩坑記 | LittleFish’Blog

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

用 vLLM 在多節(jié)點多卡上部署 Qwen2.5 以及進行推理

用 vLLM 在多節(jié)點多卡上部署 Qwen2.5 以及進行推理

部署清單

部署包準備

更新顯卡驅動

Docker

Docker Engine

Nvidia Container Toolkit

Qwen

1. 校驗模型文件

2. 配置集群

3. 啟動 vLLM 服務

參數調整過程

4. 驗證對話接口

回復都是！

5. 驗證補全接口

參考資料

相關閱讀更多精彩內容

友情鏈接更多精彩內容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

用 vLLM 在多節(jié)點多卡上部署 Qwen2.5 以及進行推理

部署清單

部署包準備

更新顯卡驅動

Docker

Docker Engine

Nvidia Container Toolkit

Qwen

1. 校驗模型文件

2. 配置集群

3. 啟動 vLLM 服務

參數調整過程

4. 驗證對話接口

回復都是 ！

5. 驗證補全接口

參考資料

相關閱讀更多精彩內容

友情鏈接更多精彩內容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

回復都是！