Coursera | Andrew Ng (02-week-2-2.9)—学习率衰减

该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂


转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」

知乎https://zhuanlan.zhihu.com/c_147249273

CSDNhttp://blog.csdn.net/junjun_zhao/article/details/79106795


2.9 Learning rate decay (学习率衰减)

(字幕来源:网易云课堂)

Coursera | Andrew Ng (02-week-2-2.9)—学习率衰减

One of the things that might help speed up your learning algorithm,is to slowly reduce your learning rate over time.We call this learning rate decay.Let’s see how you can implement this.Let’s start with an example of why you might want to implement learning rate decay.Suppose you’re implementing mini-batch gradient descent,with a reasonably small mini-batch .Maybe a mini-batch has just 64,128 examples.Then as you iterate, your steps will be a little bit noisy.And it will tend towards this minimum over here,but it won’t exactly converge.But your algorithm might just end up wandering around,and never really converge,because you’re using some fixed value for alpha.And there’s just some noise in your different mini-batch es.But if you were to slowly reduce your learning rate alpha,then during the initial phases,while your learning rate alpha is still large,you can still have relatively fast learning.But then as alpha gets smaller,your steps you take will be slower and smaller.And so you end up oscillating in a tighter region around this minimum,rather than wandering far away,even as training goes on and on.So the intuition behind slowly reducing alpha is that maybe during the initial steps of learning,you could afford to take much bigger steps.But then as learning approaches converges,then having a slower learning rate allows you to take smaller steps.

Coursera | Andrew Ng (02-week-2-2.9)—学习率衰减

加快学习算法的一个办法就是,随时间慢慢减少学习率,我们将之称为学习率衰减,我们来看看如何做到,首先通过一个例子看看,为什么要计算学习率衰减,假设你要使用 mini-batch 梯度下降法, mini-batch 数量不大,大概 64 或者 128 个样本,在迭代过程中会有噪音,下降朝向这里的最小值,但是不会精确地收敛,所以你的算法最后在附近摆动,并不会真正地收敛,因为你用的 α 是固定值,不同的 mini-batch 中有噪音,但要慢慢减少学习率 α 的话,在初期的时候,α 学习率还较大,你的学习还是相对较快,但是随着 α 变小,你的步伐也会变慢变小,所以最后你的曲线会,在最小值附近的一小块区域里摆动,而不是在训练过程中,大幅度在最小值附近摆动,所以慢慢减少 α 的本质在于,在学习初期你能承受较大的步伐,但当开始收敛的时候,小一些的学习率能让你步伐小一些。

So here’s how you can implement learning rate decay.Recall that one epoch is one pass through the data,So if you have a training set as follows,maybe you break it up into different mini-batch es.Then the first pass through the training set is called the first epoch,and then the second pass is the second epoch, and so on.So one thing you could do, is set your learning rate alpha to be equal to 1 over 1 plus a parameter which I’m going to call the decay rate,times the epoch-num.And this is going to be times some initial learning rate alpha_0.Note that the decay rate here becomes another hyper-parameter,which you might need to tune.So here’s a concrete example.If you take several epochs,so several passes through your data.If alpha0=0.2, and the decay-rate = 1,then during your first epoch,alpha will be 1 / (1 + 1 * alpha_0).So your learning rate will be 0.1.That’s just evaluating this formula,when the decay-rate is equal to 1, and the the epoch-num is 1.On the second epoch, your learning rate decays to 0.67.On the third, 0.5,on the fourth, 0.4, and so on.And feel free to evaluate more of these values yourself.And get a sense that, as a function of your epoch number your learning rate gradually decreases,right, according to this formula up on top.So if you wish to use learning rate decay, what you can do istry a variety of values of both hyper-parameter alpha 0.As well as this decay rate hyper-parameter,and then try to find the value that works well.Other than this formula for learning rate decay,there are a few other ways that people use.

Coursera | Andrew Ng (02-week-2-2.9)—学习率衰减

你可以这样做到学习率衰减,记得一代要历遍一次数据,如果你有以下这样的训练集,你应该拆分成不同的 mini-batch ,第一次历遍训练集叫做第一代,第二次就是第二代 以此类推,你可以将 α 学习率设为,1 除以 1 加上参数,我将其称为衰减率,乘以代数,再乘以初始学习率α0,注意这个衰减率,是另一个,你需要调整的超参数,这里有一个具体例子,如果你计算了几代,也就是历遍了几次,如果α0为 0.2 衰减率为1,那么在第一代中,α=1/(1+1α0),所以学习率为 0.1,这是在代入这个公式计算,此时衰减率是 1 而代数是 1,在第二代学习率为 0.67,第三代变成 0.5 第四代为 0.4 等等,你可以自己多计算几个数据,要理解 作为代数的函数,根据上述公式,你的学习率呈递减趋势。如果你想用学习率衰减 要做的是,要去尝试不同的值 包括超参数α0,以及超参数衰退率,找到合适的值,除了这个学习率衰减的公式,人们还会用其它的公式。

For example, this is called exponential decay.Where alpha is equal to some number less than 1,such as 0.95 times epoch-num, times alpha 0.So this will exponentially quickly decay your learning rate.Other formulas that people use are things like alpha equals some constant over epoch-num square root times alpha 0. Or some constant k, another hyper-parameter,over the mini-batch number t, square rooted, times alpha 0.And sometimes you also see people use a learning rate that decreases in discrete steps.Wherefore some number of steps, you have some learning rate,and then after a while you decrease it by one half.After a while by one half.After a while by one half.And so this is a discrete staircase.So so far, we’ve talked about using some formula to govern how alpha, the learning rate, changes over time.One other thing that people sometimes do,is manual decay.And so if you’re training just one model at a time, andif your model takes many hours,or even many days to train.What some people will do,is just watch your model as it’s training over a large number of days.And then manually say,it looks like the learning rate slowed down,I’m going to decrease alpha a little bit.Of course this works,this manually controlling alpha,really tuning alpha by hand,hour by hour, or day by day.This works only if you’re training only a small number of models,but sometimes people do that as well.So now you have a few more options forhow to control the learning rate alpha.Now, in case you’re thinking, wow,this is a lot of hyper-parameters.How do I select amongst all these different options?I would say, don’t worry about it for now.

Coursera | Andrew Ng (02-week-2-2.9)—学习率衰减

比如 这个叫做指数衰减,其中 α 相当于一个小于1的值,如 0.95 乘以代数乘以α0,所以你的学习率呈指数下降,人们用到的其它公式有,α=某常数/代数平方根*α0,或者用到另一超参数 常数 k,除以 mini-batch 的数字t的平方根 乘上α0,有时人们也会用一个,离散下降的学习率,也就是某个步骤有某个学习率,一会儿之后 学习率减少了一半,一会儿减少一半,一会儿又一半,这就是离散下降的意思,到现在 我们讲了一些公式下,看学习率 α 究竟如何随时间变化,人们有时候还会做一件事 手动衰减,如果你一次只训练一个模型,如果你要花上数小时或数天来训练,有些人的确会这么做,看着自己的模型训练 耗上数日,然后他们觉得,学习速率变慢了,我把 α 调小一点,手动控制 α 当然有用,一小时复一小时 日复一日地手动调整 α,只有模型数量小的时候有用,但有的时候人们也会这么做。所以现在你有了多个选择,来控制学习率 α,你可能会想 哇 好多超参数,究竟我应该做哪一个选择?我觉得 现在担心为时过早。

In next week, we’ll talk more abouthow to systematically choose hyper-parameters.For me, I would say that learning rate decayusually lower down on the list of things I try.Setting alpha, just a fixed value of alpha,and getting that to be well tuned, has a huge impact.Learning rate decay does help.Sometimes it can really help speed up training,it is a little bit lower down my list in terms of the things I would try.But next week, when we talk about hyper-parameter tuning,you see more systematic waysto organize all of these hyper-parameters.and how to efficiently search amongst them.So that’s it for learning rate decay.Finally, I was also going to talk a little bit about local optimal,and saddle points, in neural networks.So you can have a little bit better intuition about the types ofoptimization problems your optimization algorithm is trying to solve,when you’re trying to train these neural networks.Let’s go on to the next video to see that.

下一周 我们会讲到,如何系统选择超参数,对我而言 学习率衰减并不是,我尝试的要点,设定一个固定的 α,然后好好调整 会有很大的影响,学习率衰减的确大有裨益,有时候可以加快训练,但它并不是我会率先尝试的内容,但下周我们将涉及超参数调整,你能学到更多系统的办法,来管理所有的超参数,以及如何高效搜索超参数,这就是学习率衰减,最后我还要讲讲神经网络中的,局部最优以及鞍点,所以你能更好理解,在训练神经网络过程中,你的算法正在解决的优化问题,下个视频我们就好好聊聊这些问题。


重点总结:

学习率衰减

在我们利用 mini-batch 梯度下降法来寻找 Cost function 的最小值的时候,如果我们设置一个固定的学习速率 α,则算法在到达最小值点附近后,由于不同batch 中存在一定的噪声,使得不会精确收敛,而一直会在一个最小值点较大的范围内波动,如下图中蓝色线所示。

但是如果我们使用学习率衰减,逐渐减小学习速率 α,在算法开始的时候,学习速率还是相对较快,能够相对快速的向最小值点的方向下降。但随着α的减小,下降的步伐也会逐渐变小,最终会在最小值附近的一块更小的区域里波动,如图中绿色线所示。

Coursera | Andrew Ng (02-week-2-2.9)—学习率衰减

学习率衰减的实现

  • 常用:

α=11+decay_rateepoch_numα0

  • 指数衰减:

α=0.95epoch_numα0

  • 其他:

α=kepoch_numα0

  • 离散下降(不同阶段使用不同的学习速率)

参考文献:

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(2-2)– 优化算法


PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。

Coursera | Andrew Ng (02-week-2-2.9)—学习率衰减