资讯

历史

科技

环境与自然

成长

游戏

财经

文学与艺术

美食

健康

家居

文化

情感

汽车

三农

军事

旅行

运动

教育

生活

星座命理

LLM训练时GPU显存占用详解

创作时间:

作者:

@小白创作中心

LLM训练时GPU显存占用详解

引用

CSDN

https://m.blog.csdn.net/m0_49448331/article/details/144142207

在大规模语言模型（LLM）的训练过程中，GPU显存的合理管理和优化是至关重要的。本文将详细解析LLM训练时GPU显存的占用情况，包括显存占用的主要组成部分、不同精度对内存的影响，以及单卡与多卡训练中的差异。通过具体的计算和代码示例，帮助读者更好地理解这一过程。

显存占用的主要组成部分

在LLM训练过程中，GPU显存主要被以下几个部分占用：

模型参数：这是模型本身的权重参数，假设占用1个单位。
模型的梯度：与模型参数大小相同，也占用1个单位。
优化器参数：以Adam优化器为例，需要额外存储动量估计（m）和梯度方差（v）两个参数，因此占用2个单位。
激活值：在反向传播求导时需要使用，需要存储每一层的输入。以Transformer中的全连接层为例，每一层的输入参数维度为[batch, 句子长度, 每个token维度]。

不同精度对内存的占用

存在4种主要的精度类型：fp32、fp16、int8和混合精度（fp16/fp32）。

单一精度：
fp32：每个参数占用4字节
fp16：每个参数占用2字节
int8：每个参数占用1字节
混合精度：
模型参数与激活以fp16格式存储
反向传播结束时，需要拷贝fp32的参数权重与优化器状态

以LLAMA 7B模型为例：

单一精度：
fp32：模型参数占用28GB，梯度占用28GB，优化器占用56GB，总计112GB
fp16：总计56GB
int8：总计28GB
混合精度：
参数：存储两份（fp16与fp32），总计42GB
梯度：一份fp16，总计14GB
优化器状态：一份fp32，总计56GB
总计：112GB

模型与优化器占用显存的具体计算

以FP64为例的理想占用情况：

模型参数占用显存：

模型大小为88.44M，以FP32存储，每个参数需要4字节
模型参数占用显存 = 参数数量 × 4 = 0.345GB

优化器状态占用显存：

优化器需要存储梯度、一阶动量和二阶动量，每个参数需要4字节
优化器状态显存 = 88.44M × 4 × 3 = 1.036GB

总显存占用：

总显存占用 = 模型参数显存 + 优化器状态显存 = 1.38GB

单卡与多卡训练中激活值梯度的存储差异

单机单卡训练：

激活值的梯度无需显式存储，通过反向传播动态计算
存储内容：激活值、模型参数和参数梯度

单机多卡训练：

数据并行：每张卡独立完成前向和反向传播，不需要保存激活值的梯度
模型并行（张量并行、流水线并行）：可能需要临时缓存或显式存储部分结果

代码程序演示

下面通过一个简单的多层模型示例，演示显存使用情况：

import torch
import torch.nn as nn
import torch.optim as optim

# 检查设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dtype = torch.float32  # 浮点类型，可切换为 torch.float16

# 定义简单的多层模型
class SimpleModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers):
        super(SimpleModel, self).__init__()
        self.layers = nn.ModuleList(
            [nn.Linear(input_dim if i == 0 else hidden_dim, hidden_dim) for i in range(num_layers)]
        )
        self.relu = nn.ReLU()

    def forward(self, x):
        activations = []
        for layer in self.layers:
            x = self.relu(layer(x))
            activations.append(x)  # 保存每层激活值
            total_param_size = sum(p.numel() * p.element_size() for p in x) / 1024 ** 2  # MB
            print(f"x ",total_param_size) #输出x 0.125
        return x, activations

# 模型参数
input_dim = 1024
hidden_dim = 1024
num_layers = 12
batch_size = 32

# 初始化模型、输入、优化器
model = SimpleModel(input_dim, hidden_dim, num_layers).to(device, dtype=dtype)
inputs = torch.randn(batch_size, input_dim, device=device, dtype=dtype)
target = torch.randn(batch_size, hidden_dim, device=device, dtype=dtype)
optimizer = optim.Adam(model.parameters())
loss_fn = nn.MSELoss()

# 测试显存使用变化
def print_memory(stage):
    print(f"{stage}:")
    print(f"  Allocated memory: {torch.cuda.memory_allocated(device) / 1024 ** 2:.2f} MB")
    print(f"  Reserved memory: {torch.cuda.memory_reserved(device) / 1024 ** 2:.2f} MB")
    print("-" * 50)

inputs_bytes = inputs.element_size() * inputs.numel()
target_bytes = target.element_size() * target.numel()
print(f"Inputs bytes occupied by the tensor: {inputs_bytes/1024**2}MB")
print(f"Target_bytes bytes occupied by the tensor: {target_bytes/1024**2}MB")

# 清理显存并记录初始状态
torch.cuda.empty_cache()
print_memory("Initial")

# 前向传播
output, activations = model(inputs)
print_memory("After Forward Pass (activations stored)")

# 计算损失
loss = loss_fn(output, target)
print_memory("After Loss Computation")

# 反向传播
loss.backward()
print_memory("After Backward Pass (gradients computed)")

# 优化器更新
optimizer.step()
print_memory("After Optimizer Step")

total_activation_size = sum(a.numel() * a.element_size() for a in activations) / 1024 ** 2  # MB
total_param_size = sum(p.numel() * p.element_size() for p in model.parameters()) / 1024 ** 2  # MB
# 7. 计算梯度大小
total_grad_size = sum(p.grad.numel() * p.grad.element_size() for p in model.parameters()) / 1024 ** 2  # MB
# 8. 打印信息
print(f"激活值显存占用: {total_activation_size:.2f} MB")
print(f"模型参数显存占用: {total_param_size:.2f} MB")
print(f"梯度显存占用: {total_grad_size:.2f} MB")
print(f"总显存占用（前向 + 梯度 + 参数）: {total_activation_size + total_param_size + total_grad_size:.2f} MB")

运行结果：

Inputs bytes occupied by the tensor: 0.125MB
Target_bytes bytes occupied by the tensor: 0.125MB
Initial:
  Allocated memory: 48.30 MB  #48.05+0.125+0.125=48.3
  Reserved memory: 62.00 MB
--------------------------------------------------
After Forward Pass (activations stored):
  Allocated memory: 49.80 MB  #48.3+1.5=49.8
  Reserved memory: 64.00 MB
--------------------------------------------------
After Loss Computation:
  Allocated memory: 49.92 MB  # MSELoss 仅计算一个标量值
  Reserved memory: 64.00 MB
--------------------------------------------------
After Backward Pass (gradients computed):
  Allocated memory: 97.97 MB #48的梯度
  Reserved memory: 104.00 MB
--------------------------------------------------
After Optimizer Step:
  Allocated memory: 194.06 MB #96的梯度存储动量张量和平方梯度张量
  Reserved memory: 204.00 MB
--------------------------------------------------
激活值显存占用: 1.50 MB
模型参数显存占用: 48.05 MB  
梯度显存占用: 48.05 MB
总显存占用（前向 + 梯度 + 参数）: 97.59 MB

通过这个示例，我们可以清晰地看到不同阶段的显存占用情况，以及激活值、模型参数和梯度的显存占用量。