Warning: file_put_contents(/datas/wwwroot/jiajiahui/core/caches/caches_template/2/default/show.php): failed to open stream: Permission denied in /datas/wwwroot/jiajiahui/core/libraries/classes/template_cache.class.php on line 55

Warning: chmod(): Operation not permitted in /datas/wwwroot/jiajiahui/core/libraries/classes/template_cache.class.php on line 56
Covariate shift and batch normalization - 源码之家

Covariate shift and batch normalization

1 covariate shift: simply means the input of a learning model changes through batches, and inputs are also different during train and test stage,
For example, there are two inputs ranging from [0, 10] and [0, 1000], we usually normalize them in same scale [0,1], this normalization trick has two advantages,
1.1 speed up the convergence
Covariate shift and batch normalization
x1 ranges from 0-2000,x2 ranges from 1-5,when gradient decreases,direction of gradient is vertical to contour line,different scale of features result in more steps to optimazation point, compared with the scenario in right picture with same scale features, optimization in left picture is pretty slow.

1.2 higher model’s accuracy
if scales of different features are relatively different, they will have different effects when calculating distance metrics, which means we lose accuracy in feature with smaller scale. Normalization makes effects from every features roughy same.
The most commonly used normalization is
1.3 min-max normalization
xnew=xminmaxmin
1.4 zero-mean normalization
xnew=xμσ
where σ is standard deviation of data.

2 Internal covariate shift and batch normalization
Not only at the input layer, internal covariate shift also exists in hidden layers of model, we thus apply batch normalization to the input hidden layers, such that the internal layers have roughly same data distribution.
Actually when you handle generative adversarial nets (GAN), the generator keeps taking changing noise as input, without batch normalization, it will be very hard for the noise input to learn, thus batch normalization is a must in GAN.

2.1 how batch normalization works ?
you may want simply follow the pre-mentioned normalization method, after every activation function in layer A, we subtract the features mean and dividing by the features standard deviation and feed these features to layer B.
however, this simply normalization damage the learned model in layer A, the weights to layer B is not optimal anymore, gradient descent algorithms will undo this normalization if it’s a way for it to minimize the loss function.

Consequently, batch normalization adds two trainable parameters to each layer, so the normalized output is multiplied by a “standard deviation” parameter γ and add a “mean” parameter β. In other words, batch normalization lets gradient descent algorithms do the de-normalization by changing only these two weights for each activation, instead of losing the stability of the network by changing all the weights.
Covariate shift and batch normalization
where m is batch size .

In the last, we present few details about BN and dropout, by Ian Goodfellow.
Covariate shift and batch normalization