Photo-to-Caricature Translation on Faces in the Wild

Abstract:本文提出了一种人脸图片转换漫画图片的方法。设计了一个具有一个global discriminator 和一个patch discriminator的cGAN双路径模型。对于全局和补丁鉴别器，我们提出了一种新的并行卷积(parallel convolution:ParConv)，代替标准卷积(Conv)来构造并行卷积神经网络(ParCNNs)。这个parCNNs可以将来自前一层的信息与当前层合并。对于generator，我们提供另外三个额外（real-fake L1 loss, global perceptual similarity loss, and cascade layer loss）的损失，与adversarial loss一起，以限制生成的输出本身和目标的一致性。
整体结构如下图所示：

3.1 parallel convolution （parcnn）

The standard convolutional layer of CNNs： uses convolution operation as a filter to detect local conjunctions of features from the previous layer and map their appearance to a feature map.
parallel convolutional layer for CNNs：which uses convolution operation to detect local conjunctions of features from the previous layer and then capture global connections of features from the current layer.结构如图三所示：

我们为第i个特征映射提出的ParConv函数可以表示为:

where m and n are the number of feature maps from the previous layer and the current layer respectively; X represents the feature map of previous layer and $X^*$ represents the feature map of current layer convolved by feature maps of previous layer, while W and $W^*$ are their corresponding weights; f(·) indicates the activation function and we use Leaky ReLU in our experiments.

3.2 Patch discriminator

原本的pix2ix使用的是patch discriminator,但是我们通过使用ParCNN来改进,并在loss中加入了gradient penalty，所以我们的patch discriminator的loss $D_p$ 是： $L_p(G,D_p)=L_c(G,D_p)+L_{gp}$ 其中 $L_c(G,D_p)=E_{x,y \sim P_{data}(x,y)}[logD_p(x,y)]+E_{x \sim P_{data}(x),z \sim P_{style}(z)}[log(1-D_p(x,G(x,z)))]$ $L_{gp}=\Lambda E_{\hat{x}\sim P_{\hat{x}}}[(||\nabla_{\hat{x}}D_p(\hat{x})||_2-1)^2]$ $\hat{x}=\alpha G(x,z)+(1-\alpha)y$
其中 $\hat{x}$ 表示合成的假图像G(x，z)和目标图像y的混合物。 $\alpha$ 是0-1之间的随机数， $\Lambda$ =1.0

3.3 Global discriminator

全局鉴别器的loss $D_g$ :
$L_g(G,D_g)=L_c(G,D_g)=E_{x,y \sim P_{data}(x,y)}[logD_p(x,y)]+E_{x \sim P_{data}(x),z \sim P_{style}(z)}[log(1-D_p(x,G(x,z)))]$
该Global discriminator主要涉及全局结构信息，为generator提供全局感知相似性损失(the global perceptual similarity loss)。

3.4 Generator

我们使用具有skip connections的U-net[24]作为生成器，在输入和输出之间直接在网络上共享低级和高级信息。(如图2所示)此外，为了合成不同样式的图像，我们使用one-hot encoding来提供样式控制的style info vector z。

本论文中的G中我们引入了三个额外的loss，分别是 $L_{l1}$ means the real-fake L1 loss, $L_{gs}$ represents the global perceptual similarity loss, $L_{cl}$ means cascade layer loss。

所以我们最终的目标是：
γ = 50, σ =10, and η = 5

3.4.1 $L_{l1}$

此loss函数参考pix2pix， $L_{l1}=||y-G(x,z)||_1$ ,这种损失可以限制合成的fake image G(x，z)对目标图像y有意义。
Photo-to-Caricature Translation on Faces in the Wild论文笔记

3.4.2 $L_{gs}$

这种global perceptual similarity loss可以约束合成的fake image G(x，z)在感知上与目标图像y相似。 $L_{gs}=||D_g(y)-D_g(G(x,z))||_1$

Photo-to-Caricature Translation on Faces in the Wild论文笔记

3.4.2 $L_{cl}$

$L_{cl}$ 是由Cascaded Refinement Network (CRN)启发而来的。
$L_{cl}=\beta \frac{1}{n} \sum_{i=1}{n}||G(x,z)_{\Phi_i}-y_{\Phi_i}||_1$

其中φ表示经过训练的视觉感知网络的输出特征图。N是feature maps的数目。 $\beta$ =6.67
像CRN一样，我们使用预先训练过的dee CNN VGG-19来提供 $L_{cl}$ 丢失的特征图。但是与使用低层**和高层**的CRN不同，我们只使用大小为16×16的高层特征映射来解决此损失。
对于高层次的视觉信息抽象，cnn的高层可能更有帮助。

4 experiments

4.1 dataset and training

IIIT-CFW是一个野生卡通面孔的数据集，包含8928张世界知名人物批注的卡通面孔，具有不同的职业。此外，它还提供了1000个真实的跨模态检索任务的公众人物。然而，由于人脸照片和人脸漫画not paired，所以它不适合于从照片到漫画翻译任务的培训。因此，我们通过搜索IIIT-CFW数据集和互联网作为我们实验的训练集，重建了一个包含390对图像的图片漫画数据集。

4.6 style control

漫画有许多不同的风格，如素描和油画。因此，如果我们能控制翻译的风格，那将是很有用的。
为了达到这个目标，我们把我们配对的照片漫画训练数据分成不同的卡通风格类别，我们用这些分类图像训练我们的模型，通过在U-net的bottleneck处添加额外的one-hot作为样式控制的辅助标签信息。 Photo-to-Caricature Translation on Faces in the Wild论文笔记