资讯

历史

科技

环境与自然

成长

游戏

财经

文学与艺术

美食

健康

家居

文化

情感

汽车

三农

军事

旅行

运动

教育

生活

星座命理

使用KTransformer部署DeepSeekR1-671B模型的详细教程

创作时间:

作者:

@小白创作中心

使用KTransformer部署DeepSeekR1-671B模型的详细教程

引用

CSDN

https://blog.csdn.net/qq_26123545/article/details/145757197

本文将详细介绍如何使用KTransformer框架部署DeepSeekR1-671B模型的量化版本，并通过OpenwebUI实现多GPU环境下的模型运行。

ktransformers 是针对预算有限但希望运行大规模模型的用户推出的框架，采用显卡 + CPU 混合推理方案。官方使用双 Xeon ® Gold 6454S（共 64 核）、1TB-D5 内存和 RTX 4090 24GB，实现 671B 4-bit 量化版每秒 13.69 个词令生成速度。ktransformers 采用 Intel AMX 指令扩展，prefill 速度提升明显。

环境配置

CPU：Intel 61332（40C）
内存：256G D4
显卡：4090D4
存储：2T SSD + 16T HDD
操作系统：Ubuntu 22.04
CUDA版本：12.6
Python版本：3.11

支持的模型版本

其他版本可能会导致乱码，建议使用DeepSeek-R1-Q2_K_XS版本。

1. 环境准备

1.1 安装CUDA环境变量

编辑 ~/.bashrc 文件，添加以下内容：

# Adding CUDA to PATH
if [ -d "/usr/local/cuda/bin" ]; then
    export PATH=$PATH:/usr/local/cuda/bin
fi
if [ -d "/usr/local/cuda/lib64" ]; then
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
    # Or you can add it to /etc/ld.so.conf and run ldconfig as root:
    # echo "/usr/local/cuda-12.x/lib64" | sudo tee -a /etc/ld.so.conf
    # sudo ldconfig
fi
if [ -d "/usr/local/cuda" ]; then
    export CUDA_PATH=$CUDA_PATH:/usr/local/cuda
fi

保存并退出。

1.2 安装编译链

sudo apt-get update
sudo apt-get install build-essential cmake ninja-build

1.3 创建conda环境

conda create --name ktransformers python=3.11
conda activate ktransformers # you may need to run ‘conda init’ and reopen shell first
conda install -c conda-forge libstdcxx-ng
strings ~/anaconda3/envs/ktransformers/lib/libstdc++.so.6 | grep GLIBCXX

1.4 安装PyTorch等依赖

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip3 install packaging ninja cpufeature numpy

2. 安装ktransformers

2.1 克隆代码并初始化

git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git submodule init
git submodule update

2.2 安装

bash install.sh

如果双插槽CPU，且运行内存大于模型大小的两倍，可以运行：

export USE_NUMA=1
bash install.sh # or `make dev_install`

2.3 安装flashatt

wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.3cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
pip install flash_attn-2.7.4.post1+cu12torch2.3cxx11abiFALSE-cp311-cp311-linux_x86_64.whl

3. 下载模型文件

从以下地址下载模型文件：

将模型文件下载到指定目录，并将以下仓库中的文件全部复制到该目录：

https://github.com/ubergarm/r1-ktransformers-guide

4. 测试运行

创建一个prompt预提示文件：

touch p.txt

运行测试命令：

python ./ktransformers/ktransformers/local_chat.py \
    --gguf_path "/data/DeepSeek-R1/DeepSeek-R1-Q2_K_XS/" \
    --model_path "/data/DeepSeek-R1/DeepSeek-R1-Q2_K_XS/" \
    --prompt_file ./p.txt \
    --cpu_infer 38 \
    --max_new_tokens 1024 \
    --force_think true

如果遇到以下错误：

ImportError: /home/user/anaconda3/envs/ktransformers/bin/…/lib/libstdc++.so.6: version `GLIBCXX_3.4.30’ not found

可以尝试以下两种解决方案：

方法1：

sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install --only-upgrade libstdc++6
conda install -c conda-forge libstdcxx-ng

方法2：

参考以下链接：

5. Web/API运行

运行命令：

ktransformers  \
    --gguf_path "/home/user/r1_gguf/DeepSeek-R1-Q2_K_XS" \
    --model_path "/home/user/r1_gguf/DeepSeek-R1-Q2_K_XS" \
    --cpu_infer 38 \
   --no_flash_attn false \
   --total_context 2048 \
   --cache_q4 true

如果需要API：

--port 10002

如果需要直接web运行：

--port 10002
--web True

运行成功后，访问 http://localhost:10002/web/index.html#/chat

6. OpenwebUI连接

6.1 本地curl测试

curl -X 'POST' \
  'http://localhost:10002/api/generate' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "DeepSeek-R1-Q2_K_XS", 
  "prompt": "hello.",
  "stream": true
}'

6.2 OpenwebUI配置

在OpenwebUI中添加OpenAI API：

http://host.docker.internal:10002/v1
或
http://127.0.0.1:10002/v1

7. 多GPU运行

参考以下文件：

可能需要修改yml文件，具体配置如下：

- match:
    name: "^model.embed_tokens"
  replace:
    class: "default"
    kwargs:
      generate_device: "cpu"
      prefill_device: "cpu"
# GPU 0: layers 0-5
- match:
    name: "^model\\.layers\\.([0-5])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      generate_device: "cuda:0"
      generate_op: "KExpertsMarlin"
  recursive: False
# GPU 1: layers 6-11
- match:
    name: "^model\\.layers\\.([6-9]|1[0-1])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      generate_device: "cuda:1"
      generate_op: "KExpertsMarlin"
  recursive: False
# GPU 2: layers 12-17
- match:
    name: "^model\\.layers\\.(1[2-7])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      generate_device: "cuda:2"
      generate_op: "KExpertsMarlin"
  recursive: False
# GPU 3: layers 18-23
- match:
    name: "^model\\.layers\\.(1[8-9]|2[0-3])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts
    kwargs:
      generate_device: "cuda:3"
      generate_op: "KExpertsMarlin"
  recursive: False

或者使用：

ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-4.yaml

运行命令：

ktransformers  \
    --gguf_path "/home/user/r1_gguf/DeepSeek-R1-Q2_K_XS" \
    --model_path "/home/user/r1_gguf/DeepSeek-R1-Q2_K_XS" \
    --cpu_infer 38 \
    --port 10002 \
   --no_flash_attn false \
   --total_context 2048 \
   --cache_q4 true \
    --optimize_config_path  /home/user/r1_gguf/DeepSeek-R1-Q2_K_XS/custom-multi-gpu-4.yaml

注：此功能现在还不完善，官方正在努力优化多GPU实现，请耐心等待后续版本更新。

热门推荐

梨状肌综合征的鉴别诊断与治疗方法