Python实战：从零掌握随机森林算法全流程

1. 引言

在机器学习中，随机森林（Random Forest, RF） 是一种强大且常用的集成学习算法。它通过结合 多棵决策树，来提升预测精度并降低过拟合风险。

相比单棵决策树，随机森林具有以下优势：

更高准确率（Bagging 降低方差）
更强鲁棒性（对异常值不敏感）
可解释性较好（特征重要性评估）
适用场景广泛（分类、回归、特征选择等）

接下来，我们从零开始，逐步剖析随机森林。

2. 随机森林核心原理

2.1 决策树（基础单元）

随机森林由多棵决策树组成，每棵树都是一个弱分类器。
决策树工作流程：

根据特征划分样本
选择最佳划分（信息增益 / 基尼系数）
递归生成树直到达到停止条件

示意图：

特征X1?
 ├── 是 → 特征X2?
 │       ├── 是 → 类别A
 │       └── 否 → 类别B
 └── 否 → 类别C

2.2 Bagging思想（Bootstrap Aggregating）

随机森林利用 Bagging 技术提升性能：

样本随机性：每棵树在训练时，使用 有放回抽样 的子集（Bootstrap Sampling）。
特征随机性：每次划分节点时，只随机考虑部分特征。

这样，树与树之间有差异性（decorrelation），避免所有树都“想法一致”。

2.3 投票机制

分类问题：多数投票
回归问题：平均值

2.4 算法流程图

训练集 → [Bootstrap采样] → 决策树1 ──┐
训练集 → [Bootstrap采样] → 决策树2 ──┤
...                                      ├─→ 最终预测
训练集 → [Bootstrap采样] → 决策树N ──┘

3. Python 实战

我们用 scikit-learn 实现随机森林。

3.1 安装依赖

pip install scikit-learn matplotlib seaborn

3.2 训练随机森林分类器

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# 加载数据集
data = load_iris()
X, y = data.data, data.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 训练随机森林
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)

# 预测
y_pred = rf.predict(X_test)

# 评估
print("准确率:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=data.target_names))

输出示例：

准确率: 0.9777
              precision    recall  f1-score
setosa        1.00      1.00      1.00
versicolor    0.95      1.00      0.97
virginica     1.00      0.93      0.97

3.3 可视化特征重要性

import seaborn as sns

importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(8,5))
sns.barplot(x=importances[indices], y=np.array(data.feature_names)[indices])
plt.title("Feature Importance (Random Forest)")
plt.show()

4. 随机森林回归

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# 加载加州房价数据集
housing = fetch_california_housing()
X, y = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf_reg = RandomForestRegressor(n_estimators=200, max_depth=10, random_state=42)
rf_reg.fit(X_train, y_train)

y_pred = rf_reg.predict(X_test)

print("MSE:", mean_squared_error(y_test, y_pred))

5. 底层原理深度剖析

5.1 树的随机性

每棵树基于随机采样的训练集
每个节点随机选择部分特征

→ 保证森林中的多样性，降低过拟合。

5.2 OOB（Out-of-Bag）估计

每棵树大约会丢弃 1/3 的样本
这些未被抽到的样本可用于评估模型精度（OOB Score）

rf_oob = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)
rf_oob.fit(X, y)
print("OOB Score:", rf_oob.oob_score_)

5.3 偏差-方差权衡

单棵决策树：低偏差，高方差
随机森林：通过 Bagging 降低方差，同时保持低偏差

图示：

偏差 ↑
决策树：偏差低，方差高
随机森林：偏差低，方差低 → 综合性能更优

6. 高阶应用案例

6.1 特征选择

随机森林可用于筛选重要特征：

selected_features = np.array(data.feature_names)[importances > 0.1]
print("重要特征:", selected_features)

6.2 异常检测

通过预测概率的置信度，可识别异常样本。

proba = rf.predict_proba(X_test)
uncertainty = 1 - np.max(proba, axis=1)
print("Top 5 不确定预测样本:", np.argsort(uncertainty)[-5:])

6.3 超参数调优（GridSearch）

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, None],
    'max_features': ['sqrt', 'log2']
}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=3, scoring='accuracy')
grid.fit(X, y)

print("最佳参数:", grid.best_params_)
print("最佳准确率:", grid.best_score_)

7. 总结

本文系统解析了 随机森林算法：

核心机制：Bagging、特征随机性、投票
Python 实战：分类、回归、特征选择
底层原理：OOB 估计、偏差-方差权衡
扩展应用：调参、异常检测

随机森林不仅是机器学习的“入门神器”，更是工业界广泛使用的基线模型。