机器学习技法之矩阵分解（Matrix Factorization）

线性神经网络（Linear Network Hypothesis）

这里用推荐系统的应用实例引出矩阵分解：

现在有一个电影评分预测问题，那么数据集的组成为：

$\left\{ \left( \tilde { \mathbf { x } } _ { n } = ( n ) , y _ { n } = r _ { n m } \right) : \text { user } n \text { rated movie } m \right\}$

其中 $\tilde { \mathbf { x } } _ { n } = (n)$ 是一种抽象的类别（categorical）特征。

什么是类别特征呢？举例来说：比如ID号，血型（A，B，AB，O），编程语言种类（C++，Python，Java）。

但是大部分机器学习算法都是基于数值型特征实现的，当然决策树除外。所以现在需要将类别特征转换（编码，encoding）为数值特征。这里需要转换的特征是ID号。使用的工具是二值向量编码（binary vector encoding），也就是向量的每个元素只有两种数值选择，这里选择的是 0/1 向量编码，对应关系是向量中的第 ID 个元素为 1，其他元素均为 0。

那么第 m 个电影编码后的数据集 $\mathcal D_m$ 可以表示为:

$\left\{ \left( \mathbf { x } _ { n } = \text { Binary VectorEncoding } ( n ) , y _ { n } = r _ { n m } \right) : \text { user } n \text { rated movie } m \right\}$

如果将全部的电影数据整合到一起的数据集 $\mathcal D$ 可以表示为：

$\left\{ \left( \mathbf { x } _ { n } = \text { Binary VectorEncoding } ( n ) , \mathbf { y } _ { n } = \left[ \begin{array} { l l l l l l } r _ { n 1 } & ? & ? & r _ { n 4 } & r _ { n 5 } & \ldots & r _ { n M } \end{array} \right] ^ { T } \right) \right\}$

其中 $?$ 代表了该电影未评分。

现在的想法是使用一个 $N - \tilde { d } - M$ 神经网络进行特征提取：

机器学习技法之矩阵分解（Matrix Factorization）
现在先使用线性的**函数，那么由此得到的线性神经网络的结构示意图为：

机器学习技法之矩阵分解（Matrix Factorization）

基本矩阵分解（Basic Matrix Factorization）

那么现在将权重矩阵进行重命名：

$\mathrm { V } ^ { T } \text { for } \left[ w _ { n i } ^ { ( 1 ) } \right] \text { and } \mathrm { W } \text { for } \left[ w _ { i m } ^ { ( 2 ) } \right]$

那么假设函数可以写为：

$\mathrm { h } ( \mathrm { x } ) = \mathrm { W } ^ { T } \underbrace { \mathrm { Vx } } _ { \Phi ( \mathrm { x } ) }$

矩阵 $\mathrm { V }$ 实际上就是特征转换 $\Phi ( \mathrm { x } )$ ，然后再使用 $\mathrm { W }$ 进行实现一个基于转换数据的线性模型。那么根据 ID 的数值编码规则，第 $n$ 个用户的假设函数可以写为：

$\mathrm { h } \left( \mathrm { x } _ { n } \right) = \mathrm { W } ^ { T } \mathbf { v } _ { n } , \text { where } \mathbf { v } _ { n } \text { is } n \text { -th column of } \mathrm { V }$

第 $m$ 个电影的假设函数可以写为：

$h _ { m } ( \mathbf { x } ) = \mathbf { w } _ { m } ^ { T } \mathbf { \Phi } ( \mathbf { x } )$

那么对于推荐系统来说现在需要进行 $\mathrm { W }$ 和 $\mathrm { V }$ 的最优解求取。

对于 $\mathrm { W }$ 和 $\mathrm { V }$ 来说理想状态是：

$r _ { n m } = y _ { n } \approx \mathbf { w } _ { m } ^ { T } \mathbf { v } _ { n }= \mathbf { v } _ { n } ^ { T } \mathbf { w } _ { m } \Longleftrightarrow \mathbf { R } \approx \mathbf { V } ^ { T } \mathbf { W }$

也就是说特征转换矩阵 $\mathbf { V }$ 和线性模型矩阵 $\mathbf { W }$ 相乘的结果是评分矩阵。
机器学习技法之矩阵分解（Matrix Factorization）
还记得在机器学习基石中的评分预测示意图吗？观看者和电影都有自己的特征向量，只需要计算两个向量的相似度便可以用了预测评分。观看者和电影向量在这里指的是 $\mathbf { v } _ { n }$ 和 $\mathbf { w } _ { m }$ 。

机器学习技法之矩阵分解（Matrix Factorization）

那么对于数据集 $\mathcal D$ ，该假设函数的基于平方误差的误差测量为：

$E _ { \mathrm { in } } \left( \left\{ \mathbf { w } _ { m } \right\} , \left\{ \mathbf { v } _ { n } \right\} \right) = \frac { 1 } { \sum _ { m = 1 } ^ { M } \left| \mathcal { D } _ { m } \right| } \sum _ { \text {user } n \text { rated movie } m } \left( r _ { n m } - \mathbf { w } _ { m } ^ { T } \mathbf { v } _ { n } \right) ^ { 2 }$

那么现在就要根据数据集 $\mathcal D$ 进行 $\mathbf { v } _ { n }$ 和 $\mathbf { w } _ { m }$ 的学习来保证误差最小。

$\begin{aligned} \min _ { \mathrm { W } , \mathrm { V } } E _ { \mathrm { in } } \left( \left\{ \mathbf { w } _ { m } \right\} , \left\{ \mathbf { v } _ { n } \right\} \right) & \propto \sum _ { \mathrm { user } n \text { rated movie } m } \left( r _ { n m } - \mathbf { w } _ { m } ^ { T } \mathbf { v } _ { n } \right) ^ { 2 } \\ & = \sum _ { m = 1 } ^ { M } \left( \sum _ { \left( \mathbf { x } _ { n } , r _ { n m } \right) \in \mathcal { D } _ { m } } \left( r _ { n m } - \mathbf { w } _ { m } ^ { T } \mathbf { v } _ { n } \right) ^ { 2 } \right) (1)\\ & = \sum _ { n = 1 } ^ { N } \left( \sum _ { \left( \mathbf { x } _ { n } , r _ { n m } \right) \in \mathcal { D } _ { m } } \left( r _ { n m } - \mathbf { v } _ { n } ^ { T } \mathbf { w } _ { m } \right) ^ { 2 } \right) (2) \end{aligned}$

由于上式中有 $\mathbf { v } _ { n }$ 和 $\mathbf { w } _ { m }$ 两个变量，同时优化的话可能会很困难，所以基本的想法是使用交替最小化操作（alternating minimization）：

固定 $\mathbf { v } _ { n }$ ，也就是说固定用户特征向量，然后求取每一个 $\mathbf { w } _ { m } \equiv \text { minimize } E _ { \text {in } } \text { within } \mathcal { D } _ { m }$ 。
固定 $\mathbf { w } _ { m }$ ，也就是说电影的特征向量，然后求取每一个 $\mathbf { v } _ { n } \equiv \text { minimize } E _ { \text {in } } \text { within } \mathcal { D } _ { m }$ 。

这一过程叫做交替最小二乘算法（alternating least squares algorithm）。该算法的具体实现如下：

$\begin{array} { l } \text { initialize } \tilde { d } \text { dimension vectors } \left\{ \mathbf { w } _ { m } \right\} , \left\{ \mathbf { v } _ { n } \right\} \\ \text { alternating optimization of } E _ { \text {in } } : \text { repeatedly } \\ \qquad \text { optimize } \mathbf { w } _ { 1 } , \mathbf { w } _ { 2 } , \ldots , \mathbf { w } _ { M } \text { : } \text { update } \mathbf { w } _ { m } \text { by } m \text { -th-movie linear regression on } \left\{ \left( \mathbf { v } _ { n } , r _ { n m } \right) \right\} \\ \qquad \text { optimize } \mathbf { v } _ { 1 } , \mathbf { v } _ { 2 } , \ldots , \mathbf { v } _ { N } \text { : } \text { update } \mathbf { v } _ { n } \text { by } n \text { -th-user linear regression on } \left\{ \left( \mathbf { w } _ { m } , r _ { n m } \right) \right\} \\ \text { until converge } \end{array}$

初始化过程使用的是随机（randomly）选取。随着迭代的过程保证了 $E _ { \text {in } }$ 不断下降，由此保证了收敛性。交替最小二乘的过程更像用户和电影在跳探戈舞。

线性自编码器与矩阵分解（Linear Autoencoder versus Matrix Factorization）

$\begin{array}{c|c|c} &\text{Linear Autoencoder}&\text{Matrix Factorization}\\ \hline \text{goal} &\mathrm { X } \approx \mathrm { W } \left( \mathrm { W } ^ { T } \mathrm { X } \right)&\mathbf { R } \approx \mathbf { V } ^ { T } \mathbf { W }\\ \hline \text{motivation}&\text { special } d - \tilde { d } - d \text { linear NNet }&N - \tilde { d } - M \text { linear NNet }\\ \hline \text{solution} & \text { solution: local optimal via alternating least squares } &\text { global optimal at eigenvectors of } X ^ { T } X \\ \hline \text { usefulness}& \text { extract hidden user/movie features } & \text { extract dimension-reduced features } \end{array}$

所以线性自编码器是一种在矩阵 $\mathrm{X}$ 做的特殊的矩阵分解。

随机梯度法（Stochastic Gradient Descent）

相比交替迭代优化，另一种优化思路是随机梯度下降法。

回顾一下矩阵分解的误差测量函数：

$E _ { \mathrm { in } } \left( \left\{ \mathbf { w } _ { m } \right\} , \left\{ \mathbf { v } _ { n } \right\} \right) \propto \sum _ { \text {user } n \text { rated movie } m } \underbrace { \left( r _ { n m } - \mathbf { w } _ { m } ^ { T } \mathbf { v } _ { n } \right) ^ { 2 } } _ { \text {err(user } n , \text { movie } m , \text { rating } r_{nm} )}$

随机梯度下降法高效且简单，可以拓展于其他的误差测量。

由于每次只是拿出一个样本进行优化，那么先观察一下单样本的误差测量：

$\operatorname { err } \left( \text {user } n , \text { movie } m , \text { rating } r _ { n m } \right) = \left( r _ { n m } - \mathbf { w } _ { m } ^ { T } \mathbf { v } _ { n } \right) ^ { 2 }$

那么偏导数为：

$\begin{array} { r l } \nabla _ { \mathbf { v } _ { n } } & \operatorname { err } \left( \text { user } n , \text { movie } m , \text { rating } r _ { n m } \right) = - 2 \left( r _ { n m } - \mathbf { w } _ { m } ^ { T } \mathbf { v } _ { n } \right) \mathbf { w } _ { m } \\ \nabla _ { \mathbf { w } _ { m } } & \operatorname { err } \left( \text { user } n , \text { movie } m , \text { rating } r _ { n m } \right) = - 2 \left( r _ { n m } - \mathbf { w } _ { m } ^ { T } \mathbf { v } _ { n } \right) \mathbf { v } _ { n } \end{array}$

也就是说只对当前样本的 $\mathbf { v } _ { n }$ 和 $\mathbf { w } _ { m }$ 有影响，而其他的参数的偏导均为零。总结来说就是：

$\text {per-example gradient } \propto - ( \text { residual } ) ( \text { the other feature vector } )$

那么使用随机梯度下降法求解矩阵分解的实际步骤为：

$\begin{array} { l } \text { initialize } \tilde { d } \text { dimension vectors } \left\{ \mathbf { w } _ { m } \right\} , \left\{ \mathbf { v } _ { n } \right\} \text { randomly } \\ \text{ for } t = 0,1 , \ldots , T \\ \qquad \text { (1) randomly pick } ( n , m ) \text { within all known } r _ { n m } \\ \qquad \text { (2) calculate residual } \tilde { r } _ { n m } = \left( r _ { n m } - \mathbf { w } _ { m } ^ { T } \mathbf { v } _ { n } \right) \\ \qquad \text { (3) SGD-update: } \\ \qquad\qquad\qquad \begin{aligned} \mathbf { v } _ { n } ^ { n e w } & \leftarrow \mathbf { v } _ { n } ^ { o l d } + \eta \cdot \tilde { r } _ { n m } \mathbf { w } _ { m } ^ { o l d } \\ \mathbf { w } _ { m } ^ { n e w } & \leftarrow \mathbf { w } _ { m } ^ { o l d } + \eta \cdot \tilde { r } _ { n m } \mathbf { v } _ { n } ^ { o l d } \end{aligned} \end{array}$

但是注意一点随机梯度下降法是针对随机选到的样本进行优化的，那么针对一些对时间比较敏感的数据分析任务，比如近期的数据更有效，那么随机梯度下降法的随机选取应该偏重于近期的数据样本，那么效果可能会好一些。如果你明白这一点，那么在实际运用中会更容易修改该算法。

提取模型总结（Map of Extraction Models）

提取模型：思路是将特征转换作为隐变量嵌入线性模型（或者其他基础模型）中。也就是说除了模型的学习，还需要从资料中学到怎么样作转换能够有效的表现资料。

机器学习技法之矩阵分解（Matrix Factorization）
在神经网络或者深度学习中，隐含层的前 L - 1 层（ $\text { weights } w _ { i j } ^ { ( \ell ) }$ ）是进行特征的转换，最后一层是线性模型（ $\text { weights } w _ { i j } ^ { ( L ) }$ ）。也就是说在学习线性模型的同时，也学到了那些隐藏的转换。

在RBF网络中，最后一层也是线性模型（ $\text { weights } \beta _{ m }$ ，而中间潜藏的变数（中心代表， $\text { RBF centers } \mu _ { m }$ ）也是一种特征的学习。

而在矩阵分解中，学习到了两个特征那就是 $\mathbf { w } _ { m }$ 和 $\mathbf { v } _ { n }$ ，两者可以叫线性模型的权重也可以叫特征向量，这是相对于用户还是电影，不同的对象功能不同。

而在自适应提升和梯度提升（Adaptive/Gradient Boosting）中，实际上假设函数 $g_t$ 的求解就是一种特征的学习，而所学习到的系数 $\alpha_t$ 则是线性模型的权重系数。

相对来说在 k 邻近算法中，这 Top k 的邻居则是一种特征转换。而各个邻居投票系数 $y_n$ 则是一种线性模型的权重系数。

提取技术总结（Map of Extraction Techniques）

机器学习技法之矩阵分解（Matrix Factorization）

在神经网络或者深度学习中，则使用的是基于随机梯度下降法（SGD）的反向传播算法（backprop）。同时其中有一种特殊的实现：自编码器，将输入和输出保持一样，学习出一种压缩编码。

在RBF网络中，使用 k 均值聚类算法（k-means clustering）找出那些中心。

而在矩阵分解中，则可以使用的是交替最小二乘（alternating leastSQR）和随机梯度下降（SGD）。

而在自适应提升和梯度提升（Adaptive/Gradient Boosting）中，使用的技巧是梯度下降法（functional gradient descent）的思路还获取假设函数 $g_t$ 的。

相对来说在 k 邻近算法中，则使用的是一种 lazy learning，什么意思呢？在训练过程中不做什么事情，而在测试过程中，拿已有的数据做一些推论。

提取模型的优缺（Pros and Cons of Extraction Models）

提取模型（Neural Network/Deep Learning、RBF Network 、Matrix Factorization）的优缺点如下：

优点：

easy: reduces human burden in designing features
简单：减小了设计特征的人力负担
powerful : if enough hidden variables considered
强有力：如果考虑足够多的隐变量的话

缺点

hard: non-convex optimization problems in general
困难：通常是非凸优化问题
overfitting: needs proper regularization/validation
过拟合：由于很有力，所以要合理使用正则化和验证工具

机器学习技法 之 矩阵分解（Matrix Factorization）

线性神经网络（Linear Network Hypothesis）

基本矩阵分解（Basic Matrix Factorization）

线性自编码器与矩阵分解（Linear Autoencoder versus Matrix Factorization）

随机梯度法（Stochastic Gradient Descent）

提取模型总结（Map of Extraction Models）

提取技术总结（Map of Extraction Techniques）

提取模型的优缺（Pros and Cons of Extraction Models）

相关推荐

机器学习技法之矩阵分解（Matrix Factorization）