该博客是根据台大李宏毅老师的关于GAN的****所整理的笔记，建议大家可以直接看这个老师的视频（因在youtube，故需*）

文章目录

Improving Generative Adversarial Network

f-GAN

f-divergence
Fenchel Conjugate (共轭函数)
Connection with GAN

WGAN

Earch Mover's Distance 推土机距离
K-Lipschitz Function
Back to GAN

Improved WGAN
Conditional GAN

Cycle GAN

Improving Generative Adversarial Network

f-GAN

总结：除了可以使用 $\text{J-S divergence}$ 外，还可以选择其它任何 f-divergence 来评测两个分布的差异度，即任意 f-divergence 均可套用到 GANs 框架中

f-divergence

$P$ 和 $Q$ 是两个分布。 $p(x)$ 和 $q(x)$ 是从这两个分布中采样得到 $x$ 的概率，f-divergence 的计算公式为：
$D_f(P|Q) = \int_x q(x) f(\frac{p(x)}{q(x)}) dx \geq 0, \it{ s.t.}, f \text{ is convex}, f(1)=0$
该公式用于评测这两个分布之间的差异，值越高，差异越大，反之越小。

(Reverse) KL-divergence 是 f-divergence 的一种特例：
$\begin{array}{lll} f(x) = x \log x &\to& D_f(P\|Q) = \int_x q(x) \frac{p(x)}{q(x)} \log \frac{p(x)}{q(x)} dx = \int_x p(x) \log \frac{p(x)}{q(x)} dx \ \text{(KL)} \\ f(x) = - \log x &\to& D_f(P\|Q) = \int_x q(x) \left(- \log \frac{p(x)}{q(x)} \right) dx = \int_x q(x) \log \frac{q(x)}{p(x)} dx \ \text{(Reverse KL)} \\ f(x) = (x-1)^2 &\to& D_f(P\|Q) = \int_x q(x) \left( \frac{p(x)}{q(x)} -1 \right)^2 dx = \int_x \frac{(p(x)-q(x))^2}{q(x)} dx \ \text{(Chi Square)} \end{array}$

Fenchel Conjugate (共轭函数)

每一个凸函数 $f$ 都有一个共轭函数 $f^*$ : $f^*(t) = \displaystyle\max_{xt \in dom(f)}\{ xt - f(x)\}=\sup_{x \in dom(f)}(xt - f(x))$ ，即上确界。注意， $f^*$ 也是一个凸函数

例子：
$\begin{array}{lll} f(x) = x \log x &\to& f^*(t) = \max_{x \in dom(f)} \{ xt - f(x) \} \\ &\to& g(x) =xt-x \log x,\ \text{给定$t$，求解使得$g(x)$最大的$x$} \\ &\to& \text{对$g(x)$求导并令其导数为零解得: } x=\exp(t-1) \\ &\to& f^*(t) = \exp(t-1) × t - \exp(t-1) ×(t-1) = \exp(t-1) \end{array}$

Connection with GAN

$\begin{array}{rll} D_f(P\|Q) &=& \int_x q(x) f(\frac{p(x)}{q(x)}) dx \\[.4em] &=& \int_x q(x) \left( \max_{t \in dom(f^*)} \{ \frac{p(x)}{q(x)} t - f^*(t) \} \right) dx \\[.4em] && (\text{假设存在函数 $D: x \to t$}) \\[.4em] &\geq& \int_x q(x) \left( \frac{p(x)}{q(x)} D(x) - f^*(D(x)) \right) dx \\[.4em] &=& \int_x p(x) D(x) dx - \int_x q(x) f^*(D(x)) dx \\[.4em] \therefore D_f(P\|Q) &\approx& \max_D \int_x p(x) D(x) dx - \int_x q(x) f^*(D(x)) dx \\[.4em] &=& \max_D \{ E_{x \sim P}[D(x)] - E_{x \sim Q}[f^*(D(x))] \} \end{array}$

在实际中，由于 $P、Q$ 是无法得知的，故只能通过从 $P、Q$ 中采样来计算其期望值，即用样本期望来近似真实期望。套用到 f-GAN 中即为：
$\begin{array}{rll} D_f(D_{data}\|P_G) &=& \max_D \{ E_{x \sim P_{data}}[D(x)] - E_{x \sim P_G}[f^*(D(x))] \} \\ &\Downarrow& \\ G^* &=& \arg\min_G D_f(P_{data}\|P_G) \\ &=& \arg\min_G\max_D \{ E_{x \sim P_{data}}[D(x)] - E_{x \sim P_G}[f^*(D(x))] \} \\ &=& \arg\min_G\max_D V(G, D) \\ &\Downarrow& \\ G^* &=& \arg\min_{\theta_G}\max_{\theta_D} V(\theta_G, \theta_D) \end{array}$
f-GAN算法训练流程：
GANs入门系列三

WGAN

Earch Mover’s Distance 推土机距离

K-Lipschitz Function

满足如下定义的 $f$ 即为 k-Lipschitz 函数：
$\| f(x_1) - f(x_2) \| \leq K \| x_1 - x_2 \|$
从该式看，即要求 “output change” 比 “input change” 小 K 倍，背后意图是 “Do not change too fast”，更详细的可参考我的另一篇博客DL中常用的三种K-Lipschitz技术

Back to GAN

$\begin{array}{rll} D_f(P_{data}\|P_G) &=& \max_D \{ E_{x \sim P_{data}}[D(x)] - E_{x \sim P_G}[f^*(D(x))] \} \\ &\Downarrow& \\ W(P_{data},P_G) &=& \max_{D \in \text{1-Lipschitz}} \{ E_{x \sim P_{data}}[D(x)] - E_{x \in P_G}[D(x)] \} \end{array}$
与 $D_f$ 相比， $W$ 的好处（右边）；将 $D$ 限制在 1-Lipschitz 的原因（左边），本质上是限制右边绿色线的斜率，使之不会无穷大或无穷小

如何利用 gradient descent 来优化 $W(P_{data},P_G)$ ，论文提出两种方法：

Weight clipping:

将权重参数 $w$ 限制在区间 $[-c,c]$ 中。具体为：

在参数更新后，做
$w=\begin{cases} c &, \text{if }w>c \\ -c &, \text{if }w<-c \\ w &, \text{others} \end{cases}$

注意，通过 Weight clipping 后并无法保证 $D$ 是 1-Lipschitz，而只能保证它是 K-Lipschitz 的（对某个 K）。同时，也无法保证所找到的 $D$ 是一定能最大化该函数的。算法实现如下：

使用 W-GAN 的话，可以直接用Discriminator的loss来直接评测模型的好坏，这是传统的GAN所无法做到的

Gradient penalty: 即为下面 Improved WGAN 所用的方法

Improved WGAN

理论支持：一个可微函数是 1-Lipschitz 当且仅当它对所有的输入 $x$ 均有 $\| \nabla_{x}f(x) \| \leq 1$ ，即：
$D \in \text{1-Lipschitz} \iff \| \nabla_x D(x) \| \leq 1 \ \text{for all } x$
根据该理论，重写 WGAN 的 $W$ 函数后即得到 Improved WGAN 的 $W$ 函数，即：
$\begin{array}{lll} W(P_{data}, P_G) &=& \max_{D \in \text{1-Lipschitz}} \{ E_{x \sim P_{data}}[D(x)] - E_{x \sim P_G}[D(x)] \} \\ &\Downarrow& \\ W(P_{data}, P_G) &\approx& \max_D \{ E_{x \sim P_{data}}[D(x)] - E_{x \sim P_G}[D(x)] - \lambda \int_x \max(0, \| \nabla_x D(x)\| - 1) dx \} \\ &\approx& \max_D \{ E_{x \sim P_{data}}[D(x)] - E_{x \sim P_G}[D(x)] - \lambda E_{x \sim P_{penalty}} [\max(0, \| \nabla_x D(x)\| - 1)] \} \end{array}$
改成新的函数后，所求解出的函数 $D$ 随无法保证一定满足 $\| \nabla_x D(x) \| \leq 1$ ，但 $D$ 会偏向于该性质。而式中的 $P_{penalty}$ 的获取方式如下：

从 $P_{data}、P_G$ 中各采样出一个点 $x_1, x_2$
连接该两点获得线段 $l_{x_1x_2}$ ，然后再从这条线段中在采样出一个点 $x$ ，该点即为 $P_{penalty}$ 的元素
重复1,2以获取多个 $x$ 直至数量到达一个阈值

Conditional GAN

Cycle GAN

上面是真实图片风格化，下面是风格图片真实化

GANs入门系列三