问小白 wenxiaobai
资讯
历史
科技
环境与自然
成长
游戏
财经
文学与艺术
美食
健康
家居
文化
情感
汽车
三农
军事
旅行
运动
教育
生活
星座命理

梯度下降法介绍及公式推导

创作时间:
作者:
@小白创作中心

梯度下降法介绍及公式推导

引用
CSDN
1.
https://m.blog.csdn.net/jackiejane/article/details/144558588

梯度下降法是一种用于求解无约束最优化问题的迭代算法,广泛应用于机器学习和人工智能领域。它的基本思想是沿着函数梯度的反方向更新变量,以逐步接近函数的最小值点。下面我们将详细推导梯度下降法的公式。

1. 问题定义

假设我们有一个可微的凸函数$f : \mathbb{R}^n \to \mathbb{R}$,我们需要找一点$x^* \in \mathbb{R}$,使得$f(x^*)$是$f$的最小值。

2. 梯度的定义

函数$f$在点$x$的梯度是一个向量,记为$\nabla f(x)$,其第$i$个分量是$f$关于$x_i$的偏导数:

$$
\nabla f(x) = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right)^T
$$

3. 梯度下降法的基本思想

梯度下降法的基本思想是:在每一步迭代中,从当前点$x$沿着梯度的反方向移动一步,以期望减少函数值。即:

$$
x_{k+1} = x_k - \alpha_k \nabla f(x_k)
$$

其中,$\alpha_k > 0$是第$k$步的步长(也称为学习率),它控制了在梯度方向上移动的距离。

4. 步长的选择

步长$\alpha_k$的选择对算法的收敛速度和效果有重要影响。常见的选择方法包括:

  • 固定步长:在整个迭代过程中使用一个固定的步长$\alpha$。
  • 线搜索:在每一步迭代中,通过线搜索确定最优的步长$\alpha_k$,使得$f(x_k - \alpha_k \nabla f(x_k))$最小。
  • 自适应步长:根据某种规则(如Armijo规则)动态调整步长。

5. 算法步骤

  1. 初始化:选择一个初始点$x_0$和一个初始步长$\alpha_0$。
  2. 迭代:对于$k = 0, 1, 2, \ldots$:
  • 计算梯度$\nabla f(x_k)$。
  • 选择步长$\alpha_k$。
  • 更新点$x_{k+1} = x_k - \alpha_k \nabla f(x_k)$。
  1. 终止条件:当满足某个终止条件时(如梯度的模长小于某个阈值,或者达到最大迭代次数),停止迭代。

6. 示例

以拟合直线$f = wx + b$为例,说明如何使用梯度下降法来解决这个问题。

在线性回归中,使用样本数据对参数$w, b$进行拟合时,需要定义一个损失函数用来衡量预测值$f(x_i)$与真实值$y_i$的误差,而解决问题的方法就是使用梯度下降法来逐步缩小这个误差,损失函数如下:

$$
\mathcal{L} = \frac{1}{2m} \sum_{i = 0}^{m-1} (f(x_i) - y_i)^2
$$

计算梯度,对于损失函数求偏导:

$$
\begin{align}
\frac{\partial \mathcal{L}}{\partial w} &= \frac{1}{m} \sum_{i = 0}^{m-1} (f(x_i) - y_i)x_i \
\frac{\partial \mathcal{L}}{\partial b} &= \frac{1}{m} \sum_{i = 0}^{m-1} (f(x_i) - y_i)
\end{align}
$$

参数$w, b$在梯度的反方向上更新,采用固定学习率$\alpha$:

$$
\begin{align*}
w &= w - \alpha \frac{\partial \mathcal{L}}{\partial w} \
b &= b - \alpha \frac{\partial \mathcal{L}}{\partial b}
\end{align*}
$$

下面是完整的Python代码实现:

import numpy as np
import matplotlib.pyplot as plt

def compute_prediction(x, w, b):
    m = len(x)
    f_wb = np.zeros(m)
    for i in range(m):
        f_wb[i] = w * x[i] + b
    return f_wb

def compute_loss(x, y, w, b):
    m = x.shape[0]
    cost_sum = 0
    f_wb = compute_prediction(x, w, b)
    for i in range(m):
        cost = (f_wb[i] - y[i]) ** 2
        cost_sum += cost
    total = (cost_sum / (2 * m))
    return total

def compute_gradient(x, y, w, b):
    m = x.shape[0]
    dl_dw = 0
    dl_db = 0
    for i in range(m):
        f_wb = w * x[i] + b
        dl_dw_i = (f_wb - y[i])*x[i]
        dl_db_i = f_wb - y[i]
        dl_dw += dl_dw_i
        dl_db += dl_db_i
    dl_dw = dl_dw / m
    dl_db = dl_db / m
    return dl_dw, dl_db

def gradient_descent(x, y, w, b, alpha, num_iter):
    w_new = w
    b_new = b
    for i in range(num_iter):
        dl_dw, dl_db = compute_gradient(x, y, w_new, b_new)
        w_new = w_new - alpha * dl_dw
        b_new = b_new - alpha * dl_db
        loss = compute_loss(x, y, w_new, b_new)
        if i % 100 == 0:
            print(f"iteration {i}, loss = {loss}, w_new = {w_new}, b_new = {b_new}")
    return w_new, b_new

if __name__ == '__main__':
    plt.rcParams['font.family'] = ['SimHei']
    x = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0])
    y = np.array([1.5, 2.6, 3.7, 4.1, 5.1, 6.2, 7.9])
    w = 0
    b = 0
    iterations = 3000
    alpha = 0.02
    w_new, b_new = gradient_descent(x, y, w, b, alpha, iterations)
    print(f"w_new = {w_new}, b_new = {b_new}")
    plt.scatter(x, y, marker='x', c='r', label='样本点')
    _y = w_new * x + b_new
    plt.plot(x, _y, c='b', label='拟合直线')
    plt.legend()
    plt.show()

执行过程如下:

iteration 0, loss = 4.083270897959183, w_new = 0.4348571428571429, b_new = 0.08885714285714286  
iteration 100, loss = 0.045664329045427686, w_new = 1.0284820572555182, b_new = 0.2950160966273745  
iteration 200, loss = 0.04399190943292208, w_new = 1.0170929790163885, b_new = 0.35141414029427887  
iteration 300, loss = 0.04321788689124178, w_new = 1.0093449229474887, b_new = 0.389782056936218  
iteration 400, loss = 0.0428596568809325, w_new = 1.0040738754064713, b_new = 0.41588397345192607  
iteration 500, loss = 0.04269386230341597, w_new = 1.0004879511523623, b_new = 0.4336412588461576  
iteration 600, loss = 0.04261712992158636, w_new = 0.9980484260568011, b_new = 0.44572164322645447  
iteration 700, loss = 0.042581616948186136, w_new = 0.9963888029665887, b_new = 0.4539399988779966  
iteration 800, loss = 0.042565180976179944, w_new = 0.9952597517399471, b_new = 0.4595309939563808  
iteration 900, loss = 0.042557574145494485, w_new = 0.9944916516334235, b_new = 0.46333458037827735  
iteration 1000, loss = 0.04255405358260628, w_new = 0.9939691086701637, b_new = 0.4659221823106715  
iteration 1100, loss = 0.042552424209801445, w_new = 0.9936136196575556, b_new = 0.4676825429329474  
iteration 1200, loss = 0.042551670110010785, w_new = 0.9933717784177523, b_new = 0.4688801264876919  
iteration 1300, loss = 0.04255132110056829, w_new = 0.9932072524178053, b_new = 0.4696948495350295  
iteration 1400, loss = 0.04255115957341695, w_new = 0.9930953244245458, b_new = 0.4702491103556911  
iteration 1500, loss = 0.04255108481606821, w_new = 0.9930191791607023, b_new = 0.4706261772108932  
iteration 1600, loss = 0.0425510502171717, w_new = 0.9929673770954908, b_new = 0.47088269798232785  
iteration 1700, loss = 0.042551034204249084, w_new = 0.9929321358496834, b_new = 0.471057210552896  
iteration 1800, loss = 0.042551026793212446, w_new = 0.9929081610254951, b_new = 0.47117593246814665  
iteration 1900, loss = 0.04255102336326615, w_new = 0.9928918508133643, b_new = 0.47125669967657685  
iteration 2000, loss = 0.04255102177583213, w_new = 0.992880754881329, b_new = 0.4713116460775329  
iteration 2100, loss = 0.04255102104114237, w_new = 0.9928732062543251, b_new = 0.4713490264332082  
iteration 2200, loss = 0.04255102070111606, w_new = 0.9928680708795352, b_new = 0.4713744565062632  
iteration 2300, loss = 0.04255102054374656, w_new = 0.9928645772542416, b_new = 0.47139175673264894  
iteration 2400, loss = 0.0425510204709134, w_new = 0.992862200520665, b_new = 0.4714035261771313  
iteration 2500, loss = 0.04255102043720508, w_new = 0.9928605836150096, b_new = 0.47141153299856486  
iteration 2600, loss = 0.04255102042160434, w_new = 0.992859483624702, b_new = 0.47141698008568605  
iteration 2700, loss = 0.042551020414384015, w_new = 0.9928587352949075, b_new = 0.47142068577068896  
iteration 2800, loss = 0.04255102041104237, w_new = 0.9928582262018023, b_new = 0.47142320676971744  
iteration 2900, loss = 0.0425510204094958, w_new = 0.9928578798628064, b_new = 0.47142492181999723  
w_new = 0.9928576461815167, b_new = 0.4714260789959605  

运行效果如下:

总结

梯度下降法是一种简单而有效的最优化算法,通过沿着梯度的反方向更新变量,逐步接近函数的最小值点。它的成功依赖于合适的步长选择和函数的性质。在实际应用中,梯度下降法被广泛用于训练机器学习模型,如线性回归、逻辑回归和神经网络等。

© 2023 北京元石科技有限公司 ◎ 京公网安备 11010802042949号