**函数Activation Functions

sigmoid

σ (x) = \frac{1}{(1 + e^{- x})}

每个元素被压缩到[0,1]范围内 Squashes numbers to range [0,1]
它曾经一度非常流行，因为它有一个很好的解释就像神经元的饱和放电率Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron

3 problems:

1、饱和神经元将使得梯度消失 Saturated neurons “kill” the gradients

**函数Activation Functions

2、sigmoid是一个非零中心的函数 Sigmoid outputs are not zero-centered

考虑会发什么，当输入的神经元总是正数…

f (\sum_{i} w_{i} x_{i} + b)

导致它的值为正数或者为负数（这也是为什么需要零均值数据的原因）

3、指数函数计算代价稍微有点高 exp() is a bit compute expensive

tanh

**函数Activation Functions

被挤压到[-1,1]的范围内 Squashes numbers to range [-1,1]

不同点：tanh函数是以0为中心 zero centered (nice)

当它饱和的时候依然会出现梯度消失的问题 still kills gradients when saturated :(

ReLU (Rectified Linear Unit)

**函数Activation Functions

f (x) = m a x (0, x)

优点：

不会产生饱和现象 Does not saturate (in +region)
计算成本不高，比其他低 Very computationally efficient
它比sigmoid/tanh收敛的快得多，大约快6倍 Converges much faster than sigmoid/tanh in practice (e.g. 6x)
比sigmoid更具备生物学上的合理性 Actually more biologically plausible than sigmoid

缺点：

不再是以0为中心输出 Not zero-centered output

一个令我们讨厌的地方 An annoyance:

提示：当x<0时梯度会发生什么？ hint: what is the gradient when x < 0?

**函数Activation Functions

active ReLU

当x<0时称为dead ReLU：永远不会被**，永远不更新 dead ReLU:will never activate => never update

人们喜欢使用较小的正偏置来初始化ReLU people like to initialize ReLU neurons with slightly positive biases (e.g. 0.01)

**函数Activation Functions

Leaky ReLU

f (x) = m a x (0.01 x, x)

唯一的区别是有别于在负区间中保持平直，我们将在这里给出一个微小的负斜率

没有任何饱和机制 Does not saturate
计算仍然是非常高效的 Computationally efficient
比Sigmoid/tanh收敛得快，快6倍多 Converges much faster than sigmoid/tanh in practice! (e.g. 6x)
没有挂掉的问题 will not “die”.

参数整流器，简称PReLUP arametric Rectifier (PReLU)

f (x) = m a x (α x, x)

在负区间的斜率是通过alpha参数确定的 backprop into \alpha (parameter)

指数线性单元简称ELU Exponential Linear Units (ELU)

f (x) = {\begin{cases} x & if x > 0 \\ α (\exp (x) - 1) & if x \leq 0 \end{cases}

具有ReLU所有的优点 All benefits of ReLU
输出均值还接近为0 Closer to zero mean outputs
ELU没有在负区间倾斜，在一个负饱和机制，与Leaky ReLU相比较，这样使得模型对噪音具有更强的鲁棒性 Negative saturation regime compared with Leaky ReLU adds some robustness to noise

你得到这些更健壮的反**状态

最大输出神经元 Maxout “Neuron”

m a x (w_{1}^{T} x + b_{1}, w_{2}^{T} x + b_{2})

它的作用是泛化ReLU和Leaky ReLU Generalizes ReLU and Leaky ReLU
线性机制的操作！这种方式不会饱和也不会消亡 Linear Regime! Does not saturate! Does not die!

问题是你会把每个神经元的参数数量翻倍 Problem: doubles the number of parameters/neuron :(

一般最好的经验法则是使用ReLU，这是大体上能用的方法中最为标准的一种

TLDR: In practice:在实践中

使用ReLU，你会非常谨慎地调整学习速率 Use ReLU. Be careful with your learning rates
你也可以验证一下Leaky ReLU / Maxout / ELU更为实验性一点（实用性弱） Try out Leaky ReLU / Maxout / ELU
你也可以试验一下tanh但是不要希望会太好 Try out tanh but don’t expect much
一般不会用sigmoid，这是最为原始的**函数之一，并且ReLU和它的变体在那之后更广泛的地表现的更好一些 Don’t use sigmoid

**函数Activation Functions