机器学习:随机森林算法及其实现
随机森林算法描述:
如何对features进行bootstrap?
我们需要一个feature_bound参数,每次把可以选择的features打乱,从种选出log(d)个,每次选择feature划分时都是这么选择。
原来的决策树代码,是在结点的可选node维度列表里选取:
for feat in self.feats:
现在修改加入随机性:
feat_len = len(self.feats)
# 默认没有随机性
if feature_bound is None:
indices = range(0, feat_len)
elif feature_bound == "log":
# np.random.permutation(n):将数组打乱后返回
indices = np.random.permutation(feat_len)[:max(1, int(log2(feat_len)))]
else:
indices = np.random.permutation(feat_len)[:feature_bound]
tmp_feats = [self.feats[i] for i in indices]
for feat in tmp_feats:
实际上这个是拉斯维加斯随机算法,把确定性算法的某一步修改成随机概率方式。
算法代码实现:
# 导入我们自己实现的决策树模型
# 导入我们自己实现的决策树模型
from c_CvDTree.Tree import *
import numpy as np
class RandomForest(ClassifierBase):
# 建立一个决策树字典,以便调用
_cvd_trees = {
"id3": ID3Tree,
"c45": C45Tree,
"cart": CartTree
}
def __init__(self):
super(RandomForest, self).__init__()
self._trees = []
# 实现计算的函数
@staticmethod
def most_appearance(arr):
u, c = np.unique(arr, return_counts=True)
return u[np.argmax(c)]
# 默认使用 10 棵 CART 树、默认 k = log(d)
def fit(self, x, y, sample_weight=None, tree="cart", epoch=10, feature_bound="log",
*args, **kwargs):
x, y = np.atleast_2d(x), np.array(y)
n_sample = len(y)
for _ in range(epoch):
tmp_tree = RandomForest._cvd_trees[tree](*args, **kwargs)
# 每次选取n_sample个样本
_indices = np.random.randint(n_sample, size=n_sample)
if sample_weight is None:
_local_weight = None
else:
_local_weight = sample_weight[_indices]
_local_weight /= _local_weight.sum()
# 针对样本进行训练,生成树
tmp_tree.fit(x[_indices], y[_indices],
sample_weight=_local_weight, feature_bound=feature_bound)
# 把生成的树放入森林列表
self._trees.append(deepcopy(tmp_tree))
# 对个体决策树进行简单组合
# 把10棵树的判断的类别放入列表里,这里可能是多个样本所以是matrix,
# 把每个样本的类别出现次数最多的类别,即为输出结果
def predict(self, x):
_matrix = np.array([_tree.predict(x) for _tree in self._trees]).T
return np.array([RandomForest.most_appearance(rs) for rs in _matrix])