你的C++代码有多慢？使用 perf 精准揪出耗时函数！

创作时间:

作者:

@小白创作中心

你的C++代码有多慢？使用 perf 精准揪出耗时函数！

引用

CSDN

https://blog.csdn.net/Long_xu/article/details/144569407

在C++程序开发中，性能优化是一个永恒的话题。有时候，即使整体架构设计得当，某些函数的低效运行也会严重影响程序的整体性能。本文将介绍如何使用Linux系统自带的性能分析工具perf来精准定位C++代码中的耗时函数，从而有效提升程序性能。

一、背景

想象一下，你精心打造了一辆跑车，引擎强劲，车身轻盈，但轮胎却磨损严重，严重影响了速度。同样的道理，即使你的C++代码架构精巧，算法高效，但如果某些函数运行速度极慢，就像那辆跑车的磨损轮胎一样，会严重拖累整个程序的性能。

很多时候，程序的性能瓶颈并非源于整体架构设计上的缺陷，而是隐藏在某些不起眼的函数之中。这些函数或许只是代码中的一小部分，但它们却像“吞金兽”一样，消耗着大量的CPU时间，导致整个程序的效率低下。因此，精准地定位这些耗时函数，并针对性地进行优化，是提升C++程序性能的关键。

然而，程序运行缓慢，但你却无从下手优化？可能花费大量时间在代码的各个角落里寻找性能瓶颈，一遍遍地检查循环、算法和数据结构，却始终找不到问题所在。这就像大海捞针一样，效率低下且非常搞心态。

传统的调试方法，例如打印日志或使用简单的计时器，往往只能提供粗略的性能信息，难以精准地定位到具体的耗时函数。特别是当程序庞大复杂，函数调用层层嵌套时，这种困难会被进一步放大。盲目地进行代码优化，不仅耗时费力，而且可能收效甚微，甚至可能引入新的bug。因此，迫切需要一种高效、精准的方法来帮助我们快速找到程序中的耗时函数。

二、perf 工具简介

perf 是一个内建于 Linux 内核的强大性能分析工具。它能够对程序进行细致的性能剖析，精确到函数级别，帮助识别并解决程序中的性能瓶颈。通过分析 CPU 周期、缓存未命中、分支预测失败等指标，以及函数的执行时间，perf 提供了直观的数据和可视化报告（例如火焰图），从而快速找到代码中耗时的部分。

perf 工具的核心功能依赖于 perf record 和 perf report 两个主要命令。

perf record 命令负责收集程序运行时的性能数据。它会对目标程序进行采样，记录下程序执行过程中发生的各种事件，例如 CPU 周期、缓存未命中、函数调用等等。这些数据会被存储在一个名为 perf.data（或用户指定文件名）的文件中。

关键参数：

-a：分析所有进程（系统范围）
-p <PID>：分析指定进程 ID 的程序
-g：收集调用栈信息，用于生成火焰图。这是进行函数级性能分析的关键参数。
-e <event>：指定要监控的硬件或软件事件（例如 cycles，cache-misses，instructions）。
perf report 命令负责分析 perf record 命令收集到的性能数据，并将结果以用户友好的方式呈现出来。它能够根据收集的数据生成各种报告，包括文本格式的报告和可视化的火焰图。

关键参数包括：

-g：显示调用栈信息，与 perf record -g 配合使用。
--stdio：将报告输出到标准输出。
--call-graph：显示函数调用图。

我前面有一篇文章总结了 perf 工具的全部功能与操作技巧，这里就不再一一赘述，本文核心目标是如何使用 perf 工具快速揪出耗时函数。

三、使用 perf 分析 C++ 代码

#include <iostream>
#include <vector>
#include <chrono>
#include <thread>
#include <random>
#include <thread>

// Function A: Simple computation
int functionA(int n) {
    int sum = 0;
    for (int i = 0; i < n; ++i) {
        sum += i;
    }
    std::cout << "Function A result: " << sum << std::endl;
    return sum;
}

// Function B: Vector manipulation
std::vector<int> functionB(int n) {
    std::vector<int> vec(n);
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> distrib(1, 100);
    for (int i = 0; i < n; ++i) {
        vec[i] = distrib(gen);
    }
    std::cout << "Function B vector size: " << vec.size() << std::endl;
    return vec;
}

// Function C: Simulates I/O-bound operation
void functionC(int n) {
    std::this_thread::sleep_for(std::chrono::milliseconds(n));
}

int main() {
    int n = 1000000000; // Adjust this value to change the workload
    auto start = std::chrono::high_resolution_clock::now();
#if 0
    int resultA = functionA(n);
    std::vector<int> resultB = functionB(n);
    functionC(500); // Simulate a 500ms I/O operation
#else
    std::thread th(functionA, n);
    std::thread th2(functionB, n);
    std::thread th3(functionC, 500);
    th.join();
    th2.join();
    th3.join();
#endif
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    std::cout << "Total execution time: " << duration.count() << " ms" << std::endl;
    return 0;
}

这个例子包含三个函数：

functionA 进行简单的累加运算，
functionB 创建并填充一个向量，
functionC 模拟一个 I/O 操作，等待 500 毫秒。可以通过调整 n 值来改变程序的工作负载。

编译后，先执行程序让其运行起来，比如：

./test

然后使用 perf top -p $(pidof firefox) 运行 perf 进行数据收集，可以收集到类似如下的内容：

Samples: 31K of event 'cycles:u', 4000 Hz, Event count (approx.): 27011663936 lost: 0/0 drop: 0/0
Overhead  Shared Object     Symbol
  25.80%  test              [.] std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 42949
  22.43%  test              [.] std::uniform_int_distribution<int>::operator()<std::mersenne_twister_engine<unsigned long, 32ul
  14.59%  test              [.] std::uniform_int_distribution<int>::_S_nd<unsigned long, std::mersenne_twister_engine<unsigned
   9.39%  test              [.] functionB
   8.67%  test              [.] std::uniform_int_distribution<int>::param_type::a
   6.92%  test              [.] std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 42949
   5.74%  test              [.] std::uniform_int_distribution<int>::param_type::b
   3.20%  test              [.] std::uniform_int_distribution<int>::operator()<std::mersenne_twister_engine<unsigned long, 32ul
   3.14%  test              [.] std::vector<int, std::allocator<int> >::operator[]
   0.09%  test              [.] std::__fill_a1<int*, int>
   0.02%  [kernel]          [k] 0xffffffff9d000f30

说明：