
原文地址:https://alphahinex.github.io/2024/12/22/vllm-multi-node-inference/
description: "本文記錄了在兩臺機器,每臺機器一塊 Tesla T4 顯卡的環(huán)境下,使用 vLLM 部署 Qwen2.5-32B-Instruct-GPTQ-Int4 模型的過程及遇到的問題,供類似環(huán)境使用 vLLM 進行多節(jié)點多卡推理參考。"
date: 2024.12.22 10:26
categories:
- AI
tags: [AI, Python, vLLM]
keywords: vllm, gptq, gptq_marlin, tensor-parallel-size, Qwen2.5-32B-Instruct-GPTQ-Int4, multi-node inference, docker, nvidia container toolkit, max-model-len, gpu-memory-utilization, tesla t4
本文記錄了在兩臺機器,每臺機器一塊 Tesla T4 顯卡的環(huán)境下,使用 vLLM 部署 Qwen2.5-32B-Instruct-GPTQ-Int4 模型的過程及遇到的問題,供類似環(huán)境使用 vLLM 進行多節(jié)點多卡推理參考。
部署清單
- Qwen2.5-32B-Instruct-GPTQ-Int4、vLLM
- docker v27.4.0、nvidia-container-toolkit v1.17.3
- Tesla T4 顯卡驅動 v550.127.08 CUDA12.4
部署包準備
# qwen
$ git clone https://www.modelscope.cn/Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4.git
# vllm image
$ docker pull vllm/vllm-openai:v0.6.4.post1
# export
$ docker save vllm/vllm-openai:v0.6.4.post1 | gzip > images.tar.gz
更新顯卡驅動
需要更新至 cuda>=12.4,以運行 vLLM 容器。
# 先卸載之前安裝的驅動
$ sh ./NVIDIA-Linux-x86_64-550.127.08.run --uninstall
# 再安裝驅動
$ sh ./NVIDIA-Linux-x86_64-550.127.08.run
# 檢測驅動
$ nvidia-smi
Docker
Docker Engine
$ tar -xzf docker-27.4.0.tgz
$ cp docker/* /usr/local/bin/
$ docker -v
將 https://github.com/containerd/containerd/blob/main/containerd.service 內容保存至 /usr/lib/systemd/system/containerd.service:
# Copyright The containerd Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target local-fs.target dbus.service
[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/containerd
Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=5
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
OOMScoreAdjust=-999
[Install]
WantedBy=multi-user.target
$ systemctl enable --now containerd
$ systemctl status containerd
將下面內容保存至 /usr/lib/systemd/system/docker.service:
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target firewalld.service
Wants=network-online.target
[Service]
Type=notify
ExecStart=/usr/local/bin/dockerd
ExecReload=/bin/kill -s HUP $MAINPID
TimeoutStartSec=0
RestartSec=2
Restart=always
StartLimitBurst=3
StartLimitInterval=60s
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
Delegate=yes
KillMode=process
OOMScoreAdjust=-500
[Install]
WantedBy=multi-user.target
$ systemctl enable --now docker
$ systemctl status docker
Nvidia Container Toolkit
$ tar -xzf nvidia-container-toolkit_1.17.3_rpm_x86_64.tar.gz
$ cd release-v1.17.3-stable/packages/centos7/x86_64
$ rpm -i libnvidia-container1-1.17.3-1.x86_64.rpm
$ rpm -i libnvidia-container-tools-1.17.3-1.x86_64.rpm
$ rpm -i nvidia-container-toolkit-base-1.17.3-1.x86_64.rpm
$ rpm -i nvidia-container-toolkit-1.17.3-1.x86_64.rpm
# 檢查安裝情況
$ nvidia-ctk -h
# 配置 Nvidia Container Runtime
$ nvidia-ctk runtime configure --runtime=docker
# 檢查配置
$ cat /etc/docker/daemon.json
# 重啟 docker
$ systemctl restart docker
# 重啟服務后執(zhí)行如下命令查看效果:
$ docker info | grep Runtimes
Runtimes: io.containerd.runc.v2 nvidia runc
Qwen
1. 校驗模型文件
942d93a82fb6d0cb27c940329db971c1e55da78aed959b7a9ac23944363e8f47 model-00001-of-00005.safetensors
19139f34508cb30b78868db0f19ed23dbc9f248f1c5688e29000ed19b29a7eef model-00002-of-00005.safetensors
d0f829efe1693dddaa4c6e42e867603f19d9cc71806df6e12b56cc3567927169 model-00003-of-00005.safetensors
3a5a428f449bc9eaf210f8c250bc48f3edeae027c4ef8ae48dd4f80e744dd19e model-00004-of-00005.safetensors
c22a1d1079136e40e1d445dda1de9e3fe5bd5d3b08357c2eb052c5b71bf871fe model-00005-of-00005.safetensors
$ cd /root/model/Qwen2.5-32B-Instruct-GPTQ-Int4
$ sha256sum *.safetensors > sum.txt
2. 配置集群
在兩臺機器分別準備好 vllm/vllm-openai:v0.6.4.post1 鏡像后,將 https://github.com/vllm-project/vllm/blob/main/examples/run_cluster.sh 存放至 /root/model/:
#!/bin/bash
# Check for minimum number of required arguments
if [ $# -lt 4 ]; then
echo "Usage: $0 docker_image head_node_address --head|--worker path_to_hf_home [additional_args...]"
exit 1
fi
# Assign the first three arguments and shift them away
DOCKER_IMAGE="$1"
HEAD_NODE_ADDRESS="$2"
NODE_TYPE="$3" # Should be --head or --worker
PATH_TO_HF_HOME="$4"
shift 4
# Additional arguments are passed directly to the Docker command
ADDITIONAL_ARGS=("$@")
# Validate node type
if [ "${NODE_TYPE}" != "--head" ] && [ "${NODE_TYPE}" != "--worker" ]; then
echo "Error: Node type must be --head or --worker"
exit 1
fi
# Define a function to cleanup on EXIT signal
cleanup() {
docker stop node
docker rm node
}
trap cleanup EXIT
# Command setup for head or worker node
RAY_START_CMD="ray start --block"
if [ "${NODE_TYPE}" == "--head" ]; then
RAY_START_CMD+=" --head --port=6379"
else
RAY_START_CMD+=" --address=${HEAD_NODE_ADDRESS}:6379"
fi
# Run the docker command with the user specified parameters and additional arguments
docker run \
--entrypoint /bin/bash \
--network host \
--name node \
--shm-size 10.24g \
--gpus all \
-v "${PATH_TO_HF_HOME}:/root/.cache/huggingface" \
"${ADDITIONAL_ARGS[@]}" \
"${DOCKER_IMAGE}" -c "${RAY_START_CMD}"
選擇 節(jié)點1 作為 head node,節(jié)點2 作為 worker node。
在 節(jié)點1 執(zhí)行:
nohup bash run_cluster.sh \
vllm/vllm-openai:v0.6.4.post1 \
IP_OF_HEAD_NODE \
--head \
/root/model > nohup.log 2>&1 &
在 節(jié)點2 執(zhí)行:
nohup bash run_cluster.sh \
vllm/vllm-openai:v0.6.4.post1 \
IP_OF_HEAD_NODE \
--worker \
/root/model > nohup.log 2>&1 &
注意:兩個節(jié)點執(zhí)行腳本指定的都是 head 節(jié)點的 IP。
在任意節(jié)點通過 docker exec -ti node bash 進入容器:
# 查看集群狀態(tài)
$ ray status
3. 啟動 vLLM 服務
在 節(jié)點1 的容器中啟動服務(按當前顯卡配置,GPU 利用率 90% 的前提下,只能將原始模型 32k 的上下文長度縮減到 4k):
# 根據 2 個節(jié)點和每個節(jié)點 1 個 GPU 設置總的 tensor-parallel-size
$ nohup vllm serve /root/.cache/huggingface/Qwen2.5-32B-Instruct-GPTQ-Int4 \
--served-model-name Qwen2.5-32B-Instruct-GPTQ-Int4 \
--tensor-parallel-size 2 --max-model-len 4096 \
> vllm_serve_qwen_nohup.log 2>&1 &
參數調整過程
默認 gpu-memory-utilization(0.9)時,日志中輸出的 # GPU blocks 為 0。
No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine. —— --gpu-memory-utilization 0.95
調整 gpu-memory-utilization 為 0.95 后,# GPU blocks: 271,271 * 16 = 4336,即下面報錯中的 KV cache token 數。
The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (4336). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. —— --max_model_len 4096
添加 --max-model-len 4096 后,# GPU blocks: 1548
4. 驗證對話接口
curl --request POST \
-H "Content-Type: application/json" \
--url http://IP_OF_HEAD_NODE:8000/v1/chat/completions \
--data '{"messages":[{"role":"user","content":"我希望你充當 IT 專家。我會向您提供有關我的技術問題所需的所有信息,而您的職責是解決我的問題。你應該使用你的計算機科學、網絡基礎設施和 IT 安全知識來解決我的問題。在您的回答中使用適合所有級別的人的智能、簡單和易于理解的語言將很有幫助。用要點逐步解釋您的解決方案很有幫助。盡量避免過多的技術細節(jié),但在必要時使用它們。我希望您回復解決方案,而不是寫任何解釋。我的第一個問題是“我的筆記本電腦出現藍屏錯誤”。"}],"stream":true,"model":"Qwen2.5-32B-Instruct-GPTQ-Int4"}'
必須設置
Content-Type請求頭,否則會報 500 的錯誤:[Bug]: Missing Content Type returns 500 Internal Server Error instead of 415 Unsupported Media Type
回復都是 !
we currently find two workarounds
- use gptq_marlin, which is available for Ampere and later cards.
- change the number on this line from 50 to 0 and install from the modified source code. it may affect speed on short sequences though.
—— https://github.com/QwenLM/Qwen2.5/issues/1103#issuecomment-2507022590
目前 Qwen 和 vLLM 社區(qū)均向項目開發(fā)者報告了類似問題,jklj077 暫時給出了兩個繞過方案:
- 需要修改模型文件中的
config.json,將其中的"quant_method": "gptq",修改為"quant_method": "gptq_marlin",,但 需要顯卡算力在 8.0 以上; - 需要修改 vLLM 源碼,之后使用修改后源碼安裝。
5. 驗證補全接口
curl --request POST \
-H "Content-Type: application/json" \
--url http://IP_OF_HEAD_NODE:8000/v1/completions \
--data '{"prompt":"who r u?","model":"Qwen2.5-32B-Instruct-GPTQ-Int4"}'
參考資料
- nvidia顯卡驅動安裝
- Centos7.9離線安裝Docker24(無坑版)_centos7.9 離線安裝docker-CSDN博客
- 用 PaddleNLP 結合 CodeGen 實現離線 GitHub Copilot - Alpha Hinex's Blog
- [Usage]: vllm infer with 2 * Nvidia-L20, output repeat !!!!
- [Bug]: Qwen2.5-32B-GPTQ-Int4 inference !!!!!
- Distributed Inference and Serving
- vLLM - Multi-Node Inference and Serving
- 大模型推理:vllm多機多卡分布式本地部署_vllm 多卡部署-CSDN博客
- vLLM分布式多GPU Docker部署踩坑記 | LittleFish’Blog