Covariate shift and batch normalization

1 covariate shift: simply means the input of a learning model changes through batches, and inputs are also different during train and test stage,
For example, there are two inputs ranging from [0, 10] and [0, 1000], we usually normalize them in same scale [0,1], this normalization trick has two advantages,
1.1 speed up the convergence
Covariate shift and batch normalization
x1 ranges from 0-2000，x2 ranges from 1-5，when gradient decreases，direction of gradient is vertical to contour line，different scale of features result in more steps to optimazation point, compared with the scenario in right picture with same scale features, optimization in left picture is pretty slow.

1.2 higher model’s accuracy
if scales of different features are relatively different, they will have different effects when calculating distance metrics, which means we lose accuracy in feature with smaller scale. Normalization makes effects from every features roughy same.
The most commonly used normalization is
1.3 min-max normalization
$x_{n e w} = \frac{x - m i n}{m a x - m i n}$
1.4 zero-mean normalization
$x_{n e w} = \frac{x - μ}{σ}$
where $σ$ is standard deviation of data.

2 Internal covariate shift and batch normalization
Not only at the input layer, internal covariate shift also exists in hidden layers of model, we thus apply batch normalization to the input hidden layers, such that the internal layers have roughly same data distribution.
Actually when you handle generative adversarial nets (GAN), the generator keeps taking changing noise as input, without batch normalization, it will be very hard for the noise input to learn, thus batch normalization is a must in GAN.

2.1 how batch normalization works ?
you may want simply follow the pre-mentioned normalization method, after every activation function in layer A, we subtract the features mean and dividing by the features standard deviation and feed these features to layer B.
however, this simply normalization damage the learned model in layer A, the weights to layer B is not optimal anymore, gradient descent algorithms will undo this normalization if it’s a way for it to minimize the loss function.

Consequently, batch normalization adds two trainable parameters to each layer, so the normalized output is multiplied by a “standard deviation” parameter $γ$ and add a “mean” parameter $β$ . In other words, batch normalization lets gradient descent algorithms do the de-normalization by changing only these two weights for each activation, instead of losing the stability of the network by changing all the weights.
Covariate shift and batch normalization
where $m$ is batch size .

In the last, we present few details about BN and dropout, by Ian Goodfellow.
Covariate shift and batch normalization

Covariate shift and batch normalization

相关推荐