资讯

历史

科技

环境与自然

成长

游戏

财经

文学与艺术

美食

健康

家居

文化

情感

汽车

三农

军事

旅行

运动

教育

生活

星座命理

在树莓派上运行语音识别和LLama-2 GPT！

创作时间:

作者:

@小白创作中心

在树莓派上运行语音识别和LLama-2 GPT！

引用

CSDN

https://blog.csdn.net/AI_SHELL/article/details/137139104

随着人工智能技术的不断发展，越来越多的开发者开始尝试将大语言模型部署到边缘设备上。本文将详细介绍如何在树莓派上运行LLaMA-2 GPT模型和自动语音识别（ASR），实现完全离线的语音交互功能。

硬件要求

首先，我们需要准备一台Raspberry Pi 4。这款单板计算机运行Linux系统，体积小巧，仅需5V直流电源，无需风扇和主动冷却。

对于内存大小，我们有两种选择：

8GB RAM的Raspberry Pi可以运行7B LLaMA-2 GPT模型，该模型在4位量化模式下的内存占用约为5GB。
2GB或4GB RAM的设备可以运行TinyLlama-1B这样的较小模型。这个模型虽然更快，但其答案可能不如7B模型准确。

这两种模型都可以从HuggingFace下载，一般来说，几乎不需要更改代码。

显示设置

我将从显示器开始，因为在测试过程中最好在屏幕上看到一些东西。Adafruit_CircuitPython_SSD1306库允许我们在OLED显示屏上显示任何图像。该库有一个低级接口；它只能从内存缓冲区中绘制像素或单色位图。为了使用可滚动文本，我创建了一个存储文本缓冲区的数组和一个_display_update绘制文本的方法：

oled_width = 128
oled_height = 64
lines, columns = 5, 22
text_buffer = [''] * lines
current_line = 0

def display_update():
    global current_line
    oled.fill(0)
    for i, line in enumerate(text_buffer):
        oled.text(line, 0, i * 10)
    oled.show()
    current_line = (current_line + 1) % lines

这里，一个(22, 5)变量包含我们可以显示的行数和列数。如果发生这种情况，变量oled也可以是 None ImportError；例如，如果我们在笔记本电脑而不是 Raspberry Pi 上运行此代码。为了模拟文本滚动，我还创建了两个辅助方法：

def add_display_line(text):
    global current_line
    lines = text.split('\n')
    for line in lines:
        words = line.split(' ')
        current_line_text = ''
        for word in words:
            if len(current_line_text) + len(word) + 1 <= columns:
                current_line_text += word + ' '
            else:
                text_buffer[current_line] = current_line_text.strip()
                current_line = (current_line + 1) % lines
                current_line_text = word + ' '
        if current_line_text:
            text_buffer[current_line] = current_line_text.strip()
            current_line = (current_line + 1) % lines

def add_display_text(text):
    global current_line
    text_buffer[current_line] += text
    oled.text(text_buffer[current_line], 0, current_line * 10)
    oled.show()

第一种方法是在显示中添加新行；如果字符串太长，该方法会自动将其分成几行。第二种方法是添加不带“回车符”的文本标记；我将使用它来显示 GPT 模型的答案。调用add_display_line方法：

while True:
    add_display_line(f"Current time: {datetime.now()}")
    time.sleep(1)

如果一切运行正常，树莓派会重复打印当前时间：

自动语音识别（ASR）

对于 ASR，我将使用HuggingFace的Transformers库，调用它，通过几行 Python 代码实现语音识别：

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")

def transcribe_audio(audio):
    input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
    predicted_ids = model.generate(input_features)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
    return transcription[0]

这里，我使用了Whisper-tiny-en模型，该模型经过 68 万小时的语音数据训练。这是最小的 Whisper 型号；它的文件大小为 151MB。当模型加载后，我们可以使用该ffmpeg_microphone_live方法从麦克风获取数据：

import subprocess

def record_audio(duration=5):
    command = [
        'ffmpeg', '-y', '-f', 'alsa', '-i', 'default', '-t', str(duration),
        '-acodec', 'pcm_s16le', '-ar', '16000', '-ac', '1', '-f', 's16le', '-'
    ]
    process = subprocess.Popen(command, stdout=subprocess.PIPE)
    audio_data = process.communicate()[0]
    return audio_data

Raspberry Pi 没有麦克风，但任何 USB 麦克风都可以完成这项工作。该代码也可以在笔记本电脑上测试；在笔记本上，将使用内置麦克风。

大语言模型

现在，让我们添加大语言模型。首先，我们需要安装所需的库：

pip install transformers accelerate

在使用LLM之前，我们需要下载它。正如之前所讨论的，我们有两种选择。对于 8GB Raspberry Pi，我们可以使用 7B 型号。对于 2GB 设备，1B的tiny LLM是唯一可行的选择；较大的模型无法载入RAM。要下载模型，我们可以使用该huggingface-cli工具：

huggingface-cli login
huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF
huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF

我使用了 Llama-2–7b-Chat-GGUF和 TinyLlama-1–1B-Chat-v1-0-GGUF模型。较小的模型运行速度更快，但较大的模型可能会提供更好的结果。

下载模型后，我们可以使用它：

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-Chat-GGUF")
model = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-Chat-GGUF")

def generate_response(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

使用该模型很简单，但接下来：我们需要在 OLED 屏幕上流式地显示答案。为此，我将使用自定义回调，每当 LLM 生成新令牌时都会执行该回调：

def stream_response(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    streamer = TextStreamer(tokenizer)
    outputs = model.generate(**inputs, streamer=streamer)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

测试

最后，组合所有部分。代码很简单：

while True:
    audio_data = record_audio()
    transcription = transcribe_audio(audio_data)
    add_display_line(f"User: {transcription}")
    response = stream_response(transcription)
    add_display_text(f"GPT: {response}\n")

这里，Raspberry Pi 在 5 秒内录制音频，然后语音识别模型将音频转换为文本；最后，将识别出的文本发送给LLM。结束后，重复该过程。这种方法可以改进，例如，通过使用自动音频电平阈值，但对于一个weekend demo来说，它已经足够好了。

在树莓派上运行成功后，输出如下所示：

在这里，我们可以看到 Raspberry Pi 4 上的 1B LLM 推理速度。Raspberry Pi 5 应该快 30-40%。

我没有使用任何基准（例如 BLEU 或 ROUGE）来比较 1B 和 7B 型号的质量。主观上，7B 模型提供了更正确、信息更丰富的答案，但它也需要更多 RAM、更多加载时间（文件大小分别为 4.6 和 0.7GB），并且运行速度慢 3-5 倍。至于功耗，Raspberry Pi 4 在运行模型、连接 OLED 屏幕和 USB 麦克风的情况下平均需要 3-5W。