Prometheus监控指标详解：从概念到实战

创作时间:

作者:

@小白创作中心

Prometheus监控指标详解：从概念到实战

引用

CSDN

https://m.blog.csdn.net/LinkSLA/article/details/140118162

在Kubernetes已经成为容器编排标准的今天，微服务的部署变得越来越容易。但随着微服务规模的扩大，服务治理带来的挑战也越来越大。为了快速定位和解决问题，甚至在故障出现之前就能感知到服务系统的异常，可观测性应运而生。本文将重点介绍可观测性中的metrics部分，以及如何使用Prometheus进行业务指标监控。

可观测性概述

可观测性是由logging、metrics和tracing三个支柱构建的，它们分别展示了系统的不同方面：

Logging：展现应用运行产生的事件或程序执行过程中的日志，可以详细解释系统的运行状态，但存储和查询需要消耗大量资源，通常使用过滤器来减少数据量。
Metrics：是一种聚合数值，存储空间很小，可以观察系统的状态和趋势，但对于问题定位缺乏细节展示。通过等高线指标等多维数据结构可以增强细节表现力。
Tracing：面向请求，可以轻松分析出请求中的异常点，但资源消耗较大，通常需要通过采样的方式减少数据量。

本文将重点讨论metrics部分，特别是在Kubernetes环境下，Prometheus已经成为云原生服务监控体系的事实标准。

Prometheus指标类型详解

在Prometheus中，所有的指标（Metric）都通过以下格式表示：

<metric name>{<label name>=<label value>, ...}

指标名称（metric name）可以反映被监控样本的含义，如http_request_total表示当前系统接收到的HTTP请求总量。标签（label）反映了当前样本的特征维度，通过这些维度Prometheus可以对样本数据进行过滤和聚合。

Prometheus定义了4种不同的指标类型：

Counter：工作方式类似于计数器，只增不减（除非系统发生重置）。常见的监控指标，如http_requests_total、node_cpu都是Counter类型的监控指标。一般在定义Counter类型指标的名称时推荐使用_total作为后缀。
Gauge：侧重于反应系统的当前状态，这类指标的样本数据可增可减。常见指标如node_memory_MemFree（主机当前空闲的内存大小）、node_memory_MemAvailable（可用内存大小）都是Gauge类型的监控指标。
Summary：主要用于统计和分析样本的分布情况。例如，某Http请求的响应时间大多数都在100ms内，而个别请求的响应时间需要5s，这种情况下统计指标的平均值就不能反映出真实情况。而如果通过Summary指标，我们可以立即查看响应时间的9分位数。
Histogram：同样用于统计和样本分析。与Summary类型相似，但直接反应了在不同区间内样本的个数，区间通过标签len进行定义。可以通过histogram_quantile()函数计算出其值的分位数。

应用指标监控实践

暴露指标

Prometheus最常用的方式是通过pull去抓取metrics。首先需要在服务中通过/metrics接口暴露指标，这样Prometheus server就能通过http请求抓取到业务指标。

示例代码：

server := gin.New()
server.Use(middlewares.AccessLogger(), middlewares.Metric(), gin.Recovery())
server.GET("/health", func(c *gin.Context) {
    c.JSON(http.StatusOK, gin.H{
        "message": "ok",
    })
})
server.GET("/metrics", Monitor)

func Monitor(c *gin.Context) {
    h := promhttp.Handler()
    h.ServeHTTP(c.Writer, c.Request)
}

定义指标

为了方便理解，这里选取了三种类型和两种业务场景的指标：

var (
    // HTTPReqDuration metric:http_request_duration_seconds
    HTTPReqDuration *prometheus.HistogramVec
    // HTTPReqTotal metric:http_request_total
    HTTPReqTotal *prometheus.CounterVec
    // TaskRunning metric:task_running
    TaskRunning *prometheus.GaugeVec
)

func init() {
    // 监控接口请求耗时
    // 指标类型是 Histogram
    HTTPReqDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "http request latencies in seconds",
        Buckets: nil,
    }, []string{"method", "path"})

    // 监控接口请求次数
    // 指标类型是 Counter
    HTTPReqTotal = prometheus.NewCounterVec(prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "total number of http requests",
    }, []string{"method", "path", "status"})

    // 监控当前在执行的 task 数量
    // 监控类型是 Gauge
    TaskRunning = prometheus.NewGaugeVec(prometheus.GaugeOpts{
        Name: "task_running",
        Help: "current count  of running task",
    }, []string{"type", "state"})

    prometheus.MustRegister(
        HTTPReqDuration,
        HTTPReqTotal,
        TaskRunning,
    )
}

生成指标

在实际应用中，需要在适当的位置生成这些指标：

start := time.Now()
c.Next()
duration := float64(time.Since(start)) / float64(time.Second)
path := c.Request.URL.Path

// 请求数加1
controllers.HTTPReqTotal.With(prometheus.Labels{
    "method": c.Request.Method,
    "path":   path,
    "status": strconv.Itoa(c.Writer.Status()),
}).Inc()

// 记录本次请求处理时间
controllers.HTTPReqDuration.With(prometheus.Labels{
    "method": c.Request.Method,
    "path":   path,
}).Observe(duration)

// 模拟新建任务
controllers.TaskRunning.With(prometheus.Labels{
    "type":  shuffle([]string{"video", "audio"}),
    "state": shuffle([]string{"process", "queue"}),
}).Inc()

// 模拟任务完成
controllers.TaskRunning.With(prometheus.Labels{
    "type":  shuffle([]string{"video", "audio"}),
    "state": shuffle([]string{"process", "queue"}),
}).Dec()

抓取指标

Prometheus通过配置文件来抓取目标的指标：

# 抓取间隔
scrape_interval: 5s
# 目标
scrape_configs:
- job_name: 'prometheus'
    static_configs:
- targets: ['prometheus:9090']
- job_name: 'local-service'
    metrics_path: /metrics
    static_configs:
- targets: ['host.docker.internal:8000']

在实际应用中，静态配置目标地址不太适用。在Kubernetes环境下，Prometheus通过与Kubernetes API集成，支持5种服务发现模式：Node、Service、Pod、Endpoints、Ingress。

指标展示效果如下：