A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读

1 Introduction

随机梯度下降的更新流程为
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
其中xRnx\in \mathbb{R}^n为模型参数,我们可以给定包含PP个工作节点的集群来加快训练的过程,其中第pp个节点计算得到的更新为Gp(xt)G^p(x_t),更新过程修改为
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
这种同步的随机梯度下降算法称为S-SGD.理想情况下训练的速度可以加快了P倍,但由于受到通信条件的限制,并不能实现这么高的速度,因此提出了梯度稀疏化来加快通信过程,更新过程修改为
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
考虑到不同节点的非零元素坐标可能是不一样的,因此在聚合的时候通信复杂度为O(kP)O(kP).如今有一种新的稀疏化算法称为gTop-kk,可以将通信复杂度下降到O(klog P)O(klog\ P).

2 gTop-kk算法

在介绍算法前,我们首先进行一些定义,令vtv_t为每个节点本地模型,ϵtp\epsilon^p_t为节点pptt次迭代时本地梯度残差,在任何一次迭代中,每个节点都拥有相同的模型.gTop-kk算法更新流程为:
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
thrthr为两个稀疏化向量相加后第kk大的绝对值,并同时生成一个gMaskpgMask^p向量
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
具体的算法流程为
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
并且我们定义一个辅助向量xtx_t
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读

3 收敛分析

3.1 定义和假设

我们假设所有的工作节点都有完整的数据副本,并且优化的函数ff是一个非凸函数,有LL-平滑性,
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
其中计算的随机梯度是无偏的,即
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
并且定义梯度平方的上限
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
其中Gtp,(i)(x)G^{p,(i)}_t(x)为批量大小为bb的数据中第ii个样本计算的梯度,P个工作节点的总数据量为B=PbB=Pb.我们设置
Gtp(x)=1bi=1bGtp,(i)(x)G^p_t(x)=\frac{1}{b}\sum_{i=1}^bG^{p,(i)}_t(x)
因此我们可以得到
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
证明的主要思路为:

  • 首先限定没有经过稀疏化的模型xtx_t和稀疏化后的模型vtv_t之间的差距,它使我们能够限制ff的梯度的期望平方和,从而确保收敛.
  • 我们将ff的期望均方梯度与一些足够的条件绑定在一起,以得出收敛速度。

3.2 主要结果

A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
证明:对于向量xRnx\in \mathbb{R}^n,有
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
将其于假设1结合得证

A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
证明:我们可以推导出vt+1v_{t+1}xt+1x_{t+1}之间的差别,即
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
倒数第二个不等式是由于a+b2(1+γ)a2+(1+γ(1))b2||\mathbf{a}+\mathbf{b}||^2\le (1+\gamma)||\mathbf{a}||^2+(1+\gamma^{(-1)})||\mathbf{b}||^2,将上述不等式从i=0到t进行迭代可以得到:
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
证明:结合上述的Lemma 2和公式(12),我们可以得到:
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
证明:考虑到函数ffLL-平滑性,因此有
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
考虑其期望值,可以得到
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
对t求期望可以得到
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
将公式(18)应用到上述不等式中可以得到
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
因此对上述不等式进行调整可以得到
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
结合函数ffLL-平滑性,我们可以得到
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
将其于公式(21)结合可以得到
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
将上述不等式从t=1,2,…,T进行累加,可以得到:
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
两边同时除以学习率的累加
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
在条件(18)中,如果γ(1+η)<1\gamma(1+\eta)<1,则无论学习率是固定的还是递减的都可以满足,为了获取η\eta的界限,我们可以
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
因此我们可以选择η<kdk\eta<\frac{k}{d-k}以满足上述不等式.
Theorem 1暗示着当T足够大时算法1会收敛到0,其中的αt\alpha_t需要满足下述条件
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
证明:我们首先证明αt=θB/T\alpha_t=\theta \sqrt{B/T}为一个常数,能够满足条件(18),为了方便描述,我们令α=αt\alpha=\alpha_t,因此
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
因为0<τ<10<\tau<1,所以
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
因此当D=ατ1τD=\frac{\alpha\tau}{1-\tau}时条件(18)就会满足.通过应用Theorem 1,我们可以得到:
A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification 论文阅读
通过Corollary 2,我们可以发现,通过设置合适的学习率,gTop-kk的收敛率为O(1BT)O(\frac{1}{\sqrt{BT}}),与小批量SGD的结果相同,也表明当TT足够大时,kk对于收敛的影响不大.