以前偷懒没有看BN的论文，现在找工作，每个面试官必问BN，必须花时间弄清BN的原理。奉劝找算法工程师的人一定要熟练掌握BN，不能只知道它在做标准化这么简单。

Batch Normalization（BN）解决的是Internal Covariate Shift （ICS）的问题。

Internal Covariate Shift在文中定义为
The change in the distribution of network activations due to the change in network parameters during training.

也就是在训练的过程中因为网络参数改变引起网络各层输出的分布改变。

Internal Covariate Shift分为两个部分，Internal和Covariate Shift。

Covariate Shift。Covariate Shift 指在有监督学习中，对训练数据集和测试数据集，边际分布不一致，即，但条件分布一致。通过domain adaptation的方法解决。参考 https://www.quora.com/What-is-Covariate-shift
Internal。注意在网络中前一层的输出是下一层的输入。对某一中间层而言，在训练的过程中，因为前面层的参数不断变化，输出分布变化，对该层的输入分布也不断变化。

对某一中间层而言，输入的分布不断改变，需要不断适应新的分布，学习效率也就不高。

BN希望固定网络的各层输入的分布，以加快训练速度。

白化（whitening）模型的输入可以使得训练收敛更快。神经网络时多层的结构，考虑白化神经网络中各层的输入。

Gradient Descent和Normalization的关系

Gradient Descent（GD）和BN的关系分为两种情况：

gradient descent optimization does not take into account the fact that the normalization takes place
gradient descent optimization takes into account the fact that the normalization takes place

情况1中，GD不知道标准化的存在，会存在的问题。
考虑某一层有输入u，和需要学习的偏差b，标准化会首先会减均值，也就是 Batch Normalization (BN) 论文阅读笔记，这里。
如果GD不知道标准化的存在，不考虑E[x]对b对影响，那么，。对b的更新对输出没有任何对效果，而且b可以无限变大。

情况2中，GD知道标准化的存在。上面的例子中，考虑E[x]对b对影响，那么b将不会被更新。

对mini-batch做Normalization

对整个数据集做白化复杂度很高，BN做了两个必要的化简：

instead of whitening the features in layer inputs and outputs jointly, we will normalize each scalar feature independently, by making it have the mean of zero and the variance of 1.
since we use mini-batches in stochastic gradient training, each mini-batch produces estimates of the mean and variance of each activation.

对一个d维的输入x，对每一维分别标准化。比如对第k维

Batch Normalization (BN) 论文阅读笔记，这里对均值和方差都是从mini-batch求的。

注意到简单地对某个层的输入做标准化会改变这个层地表示。为了解决这个问题，BN对每个**值引入两个参数，分别做scale和shift，思想是allows the BN transform to represent the identity transformation and preserves the network capacity
（感觉类似于resnet）。即
Batch Normalization (BN) 论文阅读笔记