随机森林回归模型与特征选择实战

创作时间:

作者:

@小白创作中心

随机森林回归模型与特征选择实战

引用

CSDN

https://blog.csdn.net/Mdaying/article/details/137092247

随机森林回归模型是一种基于集成学习的机器学习方法，通过构建多个决策树并综合其预测结果来提高模型的准确性和稳定性。本文将介绍如何使用随机森林回归模型和递归特征消除（RFECV）进行特征选择和预测，包括数据预处理、模型训练、特征重要性评估和结果可视化等关键步骤。

概要

在本项目中，我们探讨了如何利用随机森林回归模型和递归特征消除(RFECV)来选择特征，并预测数据集中的目标变量。这个过程涉及到数据预处理，模型训练，特征重要性评估，以及最终的结果可视化。

整体架构流程

数据处理和分析的整个流程分为几个主要部分：

数据清洗：处理缺失值和非数值错误。
特征选择：使用RFECV方法选择最重要的特征。
模型训练：使用随机森林回归器对数据进行拟合。
结果可视化：通过图形展示不同特征数量对应的模型性能。

技术名词解释

随机森林（Random Forest）：一个由多个决策树构成的集成学习方法，用于分类和回归。
递归特征消除（Recursive Feature Elimination, RFECV）：一种特征选择方法，通过递归减少特征量来选择最重要的特征。
KFold：交叉验证方法之一，将数据集分成K个子集，进行多次训练和验证。

技术细节

在本代码中，我们使用了
RandomForestRegressor
作为基础估计器，并通过
RFECV
实施了特征选择。
KFold(5)
定义了5折交叉验证过程。

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFECV
from sklearn.model_selection import KFold
import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

data_path = '特征变量1.csv'
data = pd.read_csv(data_path)

# 替换 '#DIV/0!' 错误为NaN，并填充缺失值
data_cleaned = data.replace('#DIV/0!', np.nan).astype(float)
data_cleaned = data_cleaned.fillna(data_cleaned.mean())

# 准备数据
X_clean = data_cleaned.drop('incident_test', axis=1)
y_clean = data_cleaned['incident_test']

# 初始化随机森林回归器
rf_regressor = RandomForestRegressor(random_state=42)

# 使用RFECV进行递归特征消除，采用KFold
rfecv_regressor = RFECV(estimator=rf_regressor, step=1, cv=KFold(5), scoring='neg_mean_squared_error', min_features_to_select=1)
rfecv_regressor.fit(X_clean, y_clean)

# 获取特征重要性并排序
feature_importances_corrected = rfecv_regressor.estimator_.feature_importances_
sorted_idx_corrected = np.argsort(feature_importances_corrected)[::-1]

# 绘制修正后的特征重要性累计贡献图
plt.figure(figsize=(10, 6))
plt.bar(range(len(feature_importances_corrected)), feature_importances_corrected[sorted_idx_corrected])
plt.xlabel('特征（排序后）')
plt.ylabel('特征重要性')
plt.title('特征重要性累计贡献图')
plt.show()

# 绘制特征累计重要性图
cumulative_importances_corrected = np.cumsum(feature_importances_corrected[sorted_idx_corrected])
plt.figure(figsize=(10, 6))
plt.plot(range(len(feature_importances_corrected)), cumulative_importances_corrected, 'b-')
plt.xlabel('特征数量（排序后）')
plt.ylabel('累计重要性')
plt.title('特征累计重要性图')
plt.hlines(y=0.95, xmin=0, xmax=len(feature_importances_corrected), color='r', linestyles='dashed')
plt.show()

# 初始化随机森林回归器
rf_regressor = RandomForestRegressor(random_state=42)
rfecv_regressor = RFECV(estimator=rf_regressor, step=1, cv=KFold(5), scoring='neg_mean_squared_error', min_features_to_select=1)
rfecv_regressor.fit(X_clean, y_clean)

cv_scores = rfecv_regressor.cv_results_['mean_test_score']

# 绘制柱状图
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(cv_scores) + 1), cv_scores)
plt.xlabel('特征数量')
plt.ylabel('交叉验证得分（均方误差）')
plt.title('特征数量与模型预测精度的关系')
plt.xticks(range(1, len(cv_scores) + 1))  # 确保每个条形都有一个刻度
plt.tight_layout()  # 确保中文标题显示完整
plt.show()

# 列出每个特征个数对应的特征
features_per_count = {}
for i in range(1, len(rfecv_regressor.support_) + 1):
    features_per_count[i] = list(X_clean.columns[rfecv_regressor.ranking_ <= i])
print(features_per_count)