梯度爆炸(Exploding Gradients)

原文:A Gentle Introduction to Exploding Gradients in Neural Networks

翻译:入门 | 一文了解神经网络中的梯度爆炸(机器之心翻译)

前半部分为英文原文,后面部分为公众号的翻译。因为翻译的中文文章有些地方反倒不是那么好理解,所以我就把英文原文放在了前面。此外,英文原文下有作者和读者的答疑互动,也值得去学习。


A Gentle Introduction to Exploding Gradients in Neural Networks

Exploding gradients are a problem where large error gradients accumulate and result in very large updates to neural network model weights during training.

This has the effect of your model being unstable and unable to learn from your training data.

In this post, you will discover the problem of exploding gradients with deep artificial neural networks.

After completing this post, you will know:

  • What exploding gradients are and the problems they cause during training.
  • How to know whether you may have exploding gradients with your network model.
  • How you can fix the exploding gradient problem with your network.

Let’s get started.

梯度爆炸(Exploding Gradients)

A Gentle Introduction to Exploding Gradients in Recurrent Neural Networks
Photo by Taro Taylor, some rights reserved.

What Are Exploding Gradients?

An error gradient is the direction and magnitude calculated during the training of a neural network that is used to update the network weights in the right direction and by the right amount.

In deep networks or recurrent neural networks, error gradients can accumulate during an update and result in very large gradients. These in turn result in large updates to the network weights, and in turn, an unstable network. At an extreme, the values of weights can become so large as to overflow and result in NaN values.

The explosion occurs through exponential growth by repeatedly multiplying gradients through the network layers that have values larger than 1.0.

What Is the Problem with Exploding Gradients?

In deep multilayer Perceptron networks, exploding gradients can result in an unstable network that at best cannot learn from the training data and at worst results in NaN weight values that can no longer be updated.

… exploding gradients can make learning unstable.

— Page 282, Deep Learning, 2016.

In recurrent neural networks, exploding gradients can result in an unstable network that is unable to learn from training data and at best a network that cannot learn over long input sequences of data.

… the exploding gradients problem refers to the large increase in the norm of the gradient during training. Such events are due to the explosion of the long term components

— On the difficulty of training recurrent neural networks, 2013.

How do You Know if You Have Exploding Gradients?

There are some subtle signs that you may be suffering from exploding gradients during the training of your network, such as:

  • The model is unable to get traction on your training data (e.g. poor loss).
  • The model is unstable, resulting in large changes in loss from update to update.
  • The model loss goes to NaN during training.

If you have these types of problems, you can dig deeper to see if you have a problem with exploding gradients.

There are some less subtle signs that you can use to confirm that you have exploding gradients.

  • The model weights quickly become very large during training.
  • The model weights go to NaN values during training.
  • The error gradient values are consistently above 1.0 for each node and layer during training.

How to Fix Exploding Gradients?

There are many approaches to addressing exploding gradients; this section lists some best practice approaches that you can use.

1. Re-Design the Network Model

In deep neural networks, exploding gradients may be addressed by redesigning the network to have fewer layers.

There may also be some benefit in using a smaller batch size while training the network.

In recurrent neural networks, updating across fewer prior time steps during training, called truncated Backpropagation through time, may reduce the exploding gradient problem.

2. Use Rectified Linear Activation

In deep multilayer Perceptron neural networks, gradient exploding can occur given the choice of activation function, such as the historically popular sigmoid and tanh functions.

Exploding gradients can be reduced by using the rectified linear (ReLU) activation function.

Adopting the ReLU activation function is a new best practice for hidden layers.

3. Use Long Short-Term Memory Networks

In recurrent neural networks, gradient exploding can occur given the inherent instability in the training of this type of network, e.g. via Backpropagation through time that essentially transforms the recurrent network into a deep multilayer Perceptron neural network.

Exploding gradients can be reduced by using the Long Short-Term Memory (LSTM) memory units and perhaps related gated-type neuron structures.

Adopting LSTM memory units is a new best practice for recurrent neural networks for sequence prediction.

4. Use Gradient Clipping

Exploding gradients can still occur in very deep Multilayer Perceptron networks with a large batch size and LSTMs with very long input sequence lengths.

If exploding gradients are still occurring, you can check for and limit the size of gradients during the training of your network.

This is called gradient clipping.

Dealing with the exploding gradients has a simple but very effective solution: clipping gradients if their norm exceeds a given threshold.

— Section 5.2.4, Vanishing and Exploding Gradients, Neural Network Methods in Natural Language Processing, 2017.

Specifically, the values of the error gradient are checked against a threshold value and clipped or set to that threshold value if the error gradient exceeds the threshold.

To some extent, the exploding gradient problem can be mitigated by gradient clipping (thresholding the values of the gradients before performing a gradient descent step).

— Page 294, Deep Learning, 2016.

In the Keras deep learning library, you can use gradient clipping by setting the clipnorm or clipvalue arguments on your optimizer before training.

Good default values are clipnorm=1.0 and clipvalue=0.5.

5. Use Weight Regularization

Another approach, if exploding gradients are still occurring, is to check the size of network weights and apply a penalty to the networks loss function for large weight values.

This is called weight regularization and often an L1 (absolute weights) or an L2 (squared weights) penalty can be used.

Using an L1 or L2 penalty on the recurrent weights can help with exploding gradients

— On the difficulty of training recurrent neural networks, 2013.

In the Keras deep learning library, you can use weight regularization by setting the kernel_regularizer argument on your layer and using an L1 or L2 regularizer.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

Papers

Articles

Keras API

Summary

In this post, you discovered the problem of exploding gradients when training deep neural network models.

Specifically, you learned:

  • What exploding gradients are and the problems they cause during training.
  • How to know whether you may have exploding gradients with your network model.
  • How you can fix the exploding gradient problem with your network.



入门 | 一文了解神经网络中的梯度爆炸


梯度爆炸指神经网络训练过程中大的误差梯度不断累积,导致模型权重出现重大更新。会造成模型不稳定,无法利用训练数据学习。本文将介绍深度神经网络中的梯度爆炸问题。


阅读本文,你将了解:

  • 什么是梯度爆炸,模型训练过程中梯度爆炸会引起哪些问题;

  • 如何确定自己的网络模型是否出现梯度爆炸;

  • 如何修复梯度爆炸问题。

什么是梯度爆炸?


误差梯度是神经网络训练过程中计算的方向和数量,用于以正确的方向和合适的量更新网络权重。


在深层网络或循环神经网络中,误差梯度可在更新中累积,变成非常大的梯度,然后导致网络权重的大幅更新,并因此使网络变得不稳定。在极端情况下,权重的值变得非常大,以至于溢出,导致 NaN 值。


网络层之间的梯度(值大于 1.0)重复相乘导致的指数级增长会产生梯度爆炸。


梯度爆炸引发的问题


在深度多层感知机网络中,梯度爆炸会引起网络不稳定,最好的结果是无法从训练数据中学习,而最坏的结果是出现无法再更新的 NaN 权重值。


……梯度爆炸导致学习过程不稳定。


                                       ——《深度学习》,2016.


在循环神经网络中,梯度爆炸会导致网络不稳定,无法利用训练数据学习,最好的结果是网络无法学习长的输入序列数据。


如何确定是否出现梯度爆炸?


训练过程中出现梯度爆炸会伴随一些细微的信号,如:

  • 模型无法从训练数据中获得更新(如低损失)。

  • 模型不稳定,导致更新过程中的损失出现显著变化。

  • 训练过程中,模型损失变成 NaN。


如果你发现这些问题,那么你需要仔细查看是否出现梯度爆炸问题。


以下是一些稍微明显一点的信号,有助于确认是否出现梯度爆炸问题。

  • 训练过程中模型梯度快速变大。

  • 训练过程中模型权重变成 NaN 值。

  • 训练过程中,每个节点和层的误差梯度值持续超过 1.0。


如何修复梯度爆炸问题?


有很多方法可以解决梯度爆炸问题,本节列举了一些最佳实验方法。


1. 重新设计网络模型


在深度神经网络中,梯度爆炸可以通过重新设计层数更少的网络来解决。


使用更小的批尺寸对网络训练也有好处。


在循环神经网络中,训练过程中在更少的先前时间步上进行更新(沿时间的截断反向传播,truncated Backpropagation through time)可以缓解梯度爆炸问题。


2. 使用 ReLU **函数


在深度多层感知机神经网络中,梯度爆炸的发生可能是因为**函数,如之前很流行的 Sigmoid 和 Tanh 函数。


使用 ReLU **函数可以减少梯度爆炸。采用 ReLU **函数是最适合隐藏层的新实践。


3. 使用长短期记忆网络


在循环神经网络中,梯度爆炸的发生可能是因为某种网络的训练本身就存在不稳定性,如随时间的反向传播本质上将循环网络转换成深度多层感知机神经网络。


使用长短期记忆(LSTM)单元和相关的门类型神经元结构可以减少梯度爆炸问题。


采用 LSTM 单元是适合循环神经网络的序列预测的最新最好实践。


4. 使用梯度截断(Gradient Clipping)


在非常深且批尺寸较大的多层感知机网络和输入序列较长的 LSTM 中,仍然有可能出现梯度爆炸。如果梯度爆炸仍然出现,你可以在训练过程中检查和限制梯度的大小。这就是梯度截断。


处理梯度爆炸有一个简单有效的解决方案:如果梯度超过阈值,就截断它们。


 ——《Neural Network Methods in Natural Language Processing》,2017.


具体来说,检查误差梯度的值是否超过阈值,如果超过,则截断梯度,将梯度设置为阈值。


梯度截断可以一定程度上缓解梯度爆炸问题(梯度截断,即在执行梯度下降步骤之前将梯度设置为阈值)。


     ——《深度学习》,2016.


在 Keras 深度学习库中,你可以在训练之前设置优化器上的 clipnorm 或 clipvalue 参数,来使用梯度截断。


默认值为 clipnorm=1.0 、clipvalue=0.5。详见:https://keras.io/optimizers/。


5. 使用权重正则化(Weight Regularization)


如果梯度爆炸仍然存在,可以尝试另一种方法,即检查网络权重的大小,并惩罚产生较大权重值的损失函数。该过程被称为权重正则化,通常使用的是 L1 惩罚项(权重绝对值)或 L2 惩罚项(权重平方)。


对循环权重使用 L1 或 L2 惩罚项有助于缓解梯度爆炸。


——On the difficulty of training recurrent neural networks,2013.


在 Keras 深度学习库中,你可以通过在层上设置 kernel_regularizer 参数和使用 L1 或 L2 正则化项进行权重正则化。


延伸阅读


如想深入了解梯度爆炸,可以参阅以下资源。


书籍

  • Deep Learning, 2016.(http://amzn.to/2fwdoKR)

  • Neural Network Methods in Natural Language Processing, 2017.(http://amzn.to/2fwTPCn)

论文

  • On the difficulty of training recurrent neural networks, 2013.(http://proceedings.mlr.press/v28/pascanu13.pdf)

  • Learning long-term dependencies with gradient descent is difficult, 1994.(http://www.dsi.unifi.it/~paolo/ps/tnn-94-gradient.pdf)

  • Understanding the exploding gradient problem, 2012.(https://pdfs.semanticscholar.org/728d/814b92a9d2c6118159bb7d9a4b3dc5eeaaeb.pdf)

文章

  • Why is it a problem to have exploding gradients in a neural net (especially in an RNN)?(https://www.quora.com/Why-is-it-a-problem-to-have-exploding-gradients-in-a-neural-net-especially-in-an-RNN)

  • How does LSTM help prevent the vanishing (and exploding) gradient problem in a recurrent neural network?(https://www.quora.com/How-does-LSTM-help-prevent-the-vanishing-and-exploding-gradient-problem-in-a-recurrent-neural-network)

  • Rectifier (neural networks)(https://en.wikipedia.org/wiki/Rectifier_(neural_networks))


Keras API


  • Usage of optimizers in the Keras API(https://keras.io/optimizers/)

  • Usage of regularizers in the Keras API(https://keras.io/regularizers/)