大模型实战：Qwen2.5-7B在3090 GPU上的推理测试及幻觉问题分析

创作时间:

作者:

@小白创作中心

大模型实战：Qwen2.5-7B在3090 GPU上的推理测试及幻觉问题分析

引用

CSDN

https://m.blog.csdn.net/weixin_65514978/article/details/142466424

本文介绍如何在两块NVIDIA 3090 GPU上进行Qwen2.5-7B大模型的推理测试。文章详细描述了GPU配置、模型下载、推理测试代码以及遇到的问题和解决方案，并探讨了Qwen2.5模型存在的幻觉问题。

1. 背景介绍

最近部署了两块新的NVIDIA 3090 GPU，使用nvidia-smi查看部署情况。本次测试将使用Qwen2.5-7B模型进行推理测试。

3090 GPU的基本配置信息如上图所示。两块GPU组成GPU集群，总显存为48GB，内存带宽达到936.2 GB/s，可以应对常见的大模型推理服务以及模型微调任务。甚至使用QLoRA方法可以支持650亿参数规模的模型微调【1】。在之前的《大模型显存资源计算以及GPU如何选择》文章中，我们介绍了如何预估模型占用GPU的资源。

2. 大模型GPU推理测试

本次测试将使用Qwen2.5-7B模型，采用BF16精度。模型大小约为15GB+，单块3090 GPU足以应对。

模型下载

使用Hugging Face下载模型时可能会遇到连接问题，报错信息如下：

OSError: We couldn't connect to 'https://huggingface.co' to load this file

为快速验证模型，切换到使用ModelScope平台下载模型。ModelScope可以看作是国产版的Hugging Face，提供了国内镜像站点，模型下载速度更快。根据之前的计算，7B模型大小约为15GB，下载可能需要一些时间。

代码示例：

import torch
from modelscope import snapshot_download, AutoModel, AutoTokenizer
import os

model_dir = snapshot_download('qwen/Qwen2.5-7B-Instruct', cache_dir='/root/autodl-tmp', revision='master')

大约20分钟左右可以完成下载。

推理测试代码

以下是完整的推理测试代码：

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# 检查是否有可用的GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# 定义模型路径
mode_name_or_path = '/root/autodl-tmp/qwen/Qwen2___5-7B-Instruct'

def get_model():
    # 从预训练的模型中获取 tokenizer
    tokenizer = AutoTokenizer.from_pretrained(mode_name_or_path, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token
    # 从预训练的模型中获取模型，并设置模型参数
    model = AutoModelForCausalLM.from_pretrained(mode_name_or_path, torch_dtype=torch.bfloat16).to(device)
    return tokenizer, model

# 加载 Qwen2.5 的 model 和 tokenizer
tokenizer, model = get_model()

prompt = "我购置了两块3090的gpu，描述一下3090gpu的性能数据"
messages = [
    {"role": "system", "content": "你是一款智能的大模型用户助手工具"},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

在执行测试代码时，可能会遇到tokenizer类不存在的错误：

ValueError: Tokenizer class Qwen2Tokenizer does not exist or is not currently imported

这个错误可能与模型版本较新有关，升级本地的transformers库可以解决该问题：

pip install --upgrade transformers

升级后，可以顺利运行测试代码。示例代码中要求模型描述3090 GPU的配置信息。然而，测试中发现Qwen2.5模型存在幻觉问题，多次运行后模型会错误地声称3090是一款虚拟卡，实际上并未发布。

GPU监控信息

在测试过程中，可以使用GPU监控工具查看GPU使用情况。

3. 参考材料

【1】48G显卡微调650亿参数规模的大模型：华盛顿大学提出QLoRA方法
【2】Qwen2_5-7B-Instruct WebDemo 部署
【3】ValueError: Tokenizer class LlamaTokenizer does not exist or is not currently imported

热门推荐

拔罐可以去湿热吗