资讯

历史

科技

环境与自然

成长

游戏

财经

文学与艺术

美食

健康

家居

文化

情感

汽车

三农

军事

旅行

运动

教育

生活

星座命理

浮点数格式：FP16, BF16, FP32之间的区别与LLM显存占用

创作时间:

作者:

@小白创作中心

浮点数格式：FP16, BF16, FP32之间的区别与LLM显存占用

引用

CSDN

https://blog.csdn.net/u012856866/article/details/145949939

在深度学习和大语言模型（LLM）领域，浮点数的精度格式对模型的训练和推理性能有着重要影响。本文将详细介绍FP16、BF16和FP32三种常见的浮点数格式，分析它们之间的区别，并通过实际实验展示不同精度格式在LLM显存占用方面的差异。

1. FP16

FP16（float16）是一种半精度浮点数格式，在IEEE 754标准中被称为binary16。它使用16位二进制（16 bit，2 byte）来表示浮点数，具体结构如下：

FP16由三部分组成：

Sign(符号位): 1 位，0表示整数；1表示负数。
Exponent(指数位)：5位，表示整数部分，范围为00001(1)到11110(30)。为了表示负数，引入了一个偏置值15，使得指数范围为-14到+15。
Fraction(尾数位)：10位，表示小数部分。实际的尾数精度为11位，因为隐含了首位的1。

FP16能表示的数值范围是[-65504，65504]。在PyTorch中，可以通过torch.finfo(torch.float16)查看其详细信息：

torch.finfo(torch.float16)
# 结果
finfo(resolution=0.001, min=-65504, max=65504, eps=0.000976562, smallest_normal=6.10352e-05, tiny=6.10352e-05, dtype=float16)

2. BF16

BF16（bfloat16）是Google为机器学习专门设计的一种浮点数格式。它同样使用16位二进制，但指数位和尾数位的分配与FP16不同：

BF16的结构如下：

Sign(符号位): 1 位，0表示整数；1表示负数。
Exponent(指数位)：8位，表示整数部分，偏置值是 127。
Fraction(尾数位)：7位，表示小数部分，实际的尾数精度为8位。

BF16在PyTorch中的表示：

import torch
torch.finfo(torch.bfloat16)
# 结果
finfo(resolution=0.01, min=-3.38953e+38, max=3.38953e+38, eps=0.0078125, smallest_normal=1.17549e-38, tiny=1.17549e-38, dtype=bfloat16)

BF16的数值范围是[-3.40282e+38，3.40282e+38]，虽然精度低于FP16，但表示范围更大，能有效防止训练过程中的溢出问题。

3. FP32

FP32（float32）是一种单精度浮点数格式，在IEEE 754标准中被称为binary32。它使用32位二进制（32 bit，4 byte）来表示浮点数：

FP32的结构如下：

Sign(符号位): 1 位，0表示整数；1表示负数。
Exponent(指数位)：8位，表示整数部分，偏置值是 127。
Fraction(尾数位)：23位，表示小数部分，实际的尾数精度为24位。

FP32在PyTorch中的表示：

import torch
torch.finfo(torch.float32)
# 结果
finfo(resolution=1e-06, min=-3.40282e+38, max=3.40282e+38, eps=1.19209e-07, smallest_normal=1.17549e-38, tiny=1.17549e-38, dtype=float32)

FP32的数值范围是[-3.40282e+38，3.40282e+38]，精度和表示范围都优于FP16和BF16，但占用的存储空间也更大。

4. 不同精度的显存占用

以NVIDIA A40 48G显卡和llama-2-7b-hf模型为例，分析不同精度格式的显存占用情况。模型默认保存精度为float16。

首先检查系统环境：

import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# 打印版本号
print("transformers version:", transformers.__version__)
print("torch version:", torch.__version__)

# 检查系统中是否有可用的 GPU
if torch.cuda.is_available():
    # 获取可用的 GPU 设备数量
    num_devices = torch.cuda.device_count()
    print("可用 GPU 数量:", num_devices)

    # 遍历所有可用的 GPU 设备并打印详细信息
    for i in range(num_devices):
        device = torch.cuda.get_device_properties(i)
        print(f"\nGPU {i} 的详细信息:")
        print("名称:", device.name)
        print("计算能力:", f"{device.major}.{device.minor}")
        print("内存总量 (GB):", round(device.total_memory / (1024**3), 1))
else:
    print("没有可用的 GPU")

输出结果：

transformers version: 4.32.1
torch version: 2.0.1+cu117
可用 GPU 数量: 1

GPU 0 的详细信息:
名称: NVIDIA A40
计算能力: 8.6
内存总量 (GB): 44.4

加载模型并查看显存占用：

FP16加载

# 加载模型
model_name = "/path/to/llama-2-7b-hf" # 你模型存放的位置
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda:0", torch_dtype=torch.float16)

total_parameters = model.num_parameters()
print("Total parameters in the model:", total_parameters)

# 计算每个参数的大小（以字节为单位）
size_per_parameter_bytes = 2

# 计算模型在显存中的总空间（以字节为单位）
total_memory_bytes = total_parameters * size_per_parameter_bytes

# 将字节转换为更常见的单位（GB）
total_memory_gb = total_memory_bytes / (1024**3)

print("Total memory occupied by the model in GB:", total_memory_gb)

# 计算模型的显存占用
memory_allocated = torch.cuda.memory_allocated(device='cuda:0')

# 将字节转换为更常见的单位（GB）
memory_allocated_gb = memory_allocated / (1024**3)

print("Memory allocated by the model in GB:", memory_allocated_gb)

输出结果：

Total parameters in the model: 6738415616 # 6.73B
Total memory occupied by the model in GB: 12.551277160644531
Memory allocated by the model in GB: 12.582542419433594

BF16加载

# 加载模型
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda:0", torch_dtype=torch.bfloat16)

# 计算模型的显存占用
memory_allocated = torch.cuda.memory_allocated(device='cuda:0')

# 将字节转换为更常见的单位（GB）
memory_allocated_gb = memory_allocated / (1024**3)

print("Memory allocated by the model in GB:", memory_allocated_gb)

输出结果：

Memory allocated by the model in GB: 12.582542419433594

FP32加载

# 加载模型
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda:0", torch_dtype=torch.float32)

# 计算模型的显存占用
memory_allocated = torch.cuda.memory_allocated(device='cuda:0')

# 将字节转换为更常见的单位（GB）
memory_allocated_gb = memory_allocated / (1024**3)

print("Memory allocated by the model in GB:", memory_allocated_gb)

输出结果：

Memory allocated by the model in GB: 25.165069580078125

从实验结果可以看出，llama-2-7b-hf模型有70亿个参数，使用FP16加载时占用约12.6GB显存，使用BF16加载时占用相同大小的显存，而使用FP32加载时则占用约25.2GB显存。

5. 不同精度之间的转换

在加载模型时，可以通过PyTorch内置的转换函数实现不同精度格式之间的转换。

5.1 FP16转化为FP32

# 以float16加载
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda:0", torch_dtype=torch.float16)

# 转为float32
model.float()

5.2 FP32转化为FP16

# 以float32加载
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda:0", torch_dtype=torch.float32)

# 转为float16
model.half()

参考资料

LLM大模型之精度问题（FP16，FP32，BF16）详解与实践
LLM大模型之不同精度下显存占用与相互转换实践
LLM的不同精度详解和显存占用，FP16，FP32，BF16