參考鏈接
昇騰官方部署手冊(cè)
https://docs.vllm.ai/projects/ascend/zh-cn/v0.13.0/tutorials/DeepSeek-V4.html
魔塔社區(qū)部署手冊(cè)
https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V4-Flash-w8a8-mtp/feedback
AtomGit AI部署手冊(cè)
https://ai.gitcode.com/Ascend-SACT/DeepSeek-V4-Flash
本次安裝服務(wù)器
昇騰 華為 Atlas 900 RCK A2
內(nèi)存:1T
NPU: 8 張 昇騰 910B2
在宿主機(jī)上下載補(bǔ)丁
先把補(bǔ)丁下載到宿主機(jī),后面啟動(dòng)docker容器的時(shí)候,掛載補(bǔ)丁目錄,容器內(nèi)部自動(dòng)安裝補(bǔ)丁。
本次補(bǔ)丁目標(biāo)是讓 DeepSeek-V4-Flash-w8a8-mtp 在 vllm-ascend 上支持以下能力:
--tokenizer-mode deepseek_v4--tool-call-parser deepseek_v4--enable-auto-tool-choice--reasoning-parser deepseek_v4
PATCH_DIR=/root/patches
cd "$PATCH_DIR"
# 下載補(bǔ)丁
git clone https://atomgit.com/Ascend-SACT/DeepSeek-V4-Flash.git
生產(chǎn)啟動(dòng)方式
export IMAGE=quay.io/ascend/vllm-ascend:v0.13.0rc3
export NAME=vllm-ascend-deepseek_v4_flash
export PATCH_DIR="/data1/models/DeepSeek-V4-Flash-w8a8-mtp"
docker run -d \
--name $NAME \
--net=host \
--shm-size=1024g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /root/.cache:/root/.cache \
-v /data1/models:/data1/models:ro \
-v "$PATCH_DIR/DeepSeek-V4-Flash":/tmp/deepseek-patch:ro \
$IMAGE \
bash -c "
set -e
echo '應(yīng)用補(bǔ)丁...'
cd /vllm-workspace/vllm && git apply /tmp/deepseek-patch/deepseek-v4-agentic-support.patch
echo '應(yīng)用補(bǔ)丁成功'
export HCCL_OP_EXPANSION_MODE=AIV
export USE_MULTI_BLOCK_POOL=1
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export ACL_OP_INIT_MODE=1
export TRITON_ALL_BLOCKS_PARALLEL=1
echo '開始啟動(dòng)模型'
vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V4-Flash-w8a8-mtp \
--host 0.0.0.0 \
--max_model_len 524288 \
--max-num-batched-tokens 8192 \
--served-model-name deepseek-v4-flash \
--gpu-memory-utilization 0.9 \
--max-num-seqs 10 \
--data-parallel-size 1 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--quantization ascend \
--port 8006 \
--block-size 128 \
--chat-template /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V4-Flash-w8a8-mtp/chat_template.jinja \
--async-scheduling \
--additional-config '{\"enable_cpu_binding\": \"true\", \"multistream_overlap_shared_expert\": true}' \
--speculative-config '{\"num_speculative_tokens\": 1,\"method\": \"deepseek_mtp\"}' \
--compilation-config '{\"cudagraph_mode\":\"FULL_DECODE_ONLY\"}' \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4
"