Controllable Person Image Synthesis with Attribute-Decomposed GAN(CVPR20)

3. Method Description

Controllable Person Image Synthesis with Attribute-Decomposed GAN(CVPR20)
framework中涉及到pose PR18×H×WP\in\mathbb{R}^{18\times H\times W}表示为18通道的heatmap

3.1. Generator

Generator的输入为source person image IsI_s和target pose PtP_t,输出为generated image IgI_g

一个常见的做法是将IsI_sPtP_t拼接起来送入生成器
本文将IsI_sPtP_t编码为latent code,分别叫做pose encoding和decomposed component encoding

3.1.1 Pose encoding

如Fig.2上方所示,target pose PtP_t输入pose encoder(是一个2层down-sampling convolutional layers结构),得到pose code CposeC_{pose}

3.1.2 Decomposed component encoding(DCE)

对source person image IsI_s提取semantic map SS,将SS表示为KK通道的heatmap MRK×H×WM\in\mathbb{R}^{K\times H\times W}KK为human parser的分割类别(实验中K=8K=8,包括background, hair, face, upper clothes, pants, skirt, arm and leg),每一个通道是一个binary mask MiRH×WM_i\in\mathbb{R}^{H\times W},与IsI_s进行点乘,就得到了decomposed person image,即通过分割mask完成了source image每一个component的分解
Isi=IsMi(1) I_s^i=I_s\odot M_i \qquad(1)
然后将每一个IsiI_s^i送入texture encoder TencT_{enc},得到style code CstyiC_{sty}^i
Cstyi=Tenc(Isi)(2) C_{sty}^i=T_{enc}\left ( I_s^i \right ) \qquad(2)
将style code CstyiC_{sty}^i拼接到一起,得到full style code CstyC_{sty},见Fig.2中的\otimes操作

In contrast to the common solution that directly encodes the entire source person image, this intuitive DCE module decomposes the source person into multiple components and recombines their latent codes to construct the full style code.

仔细想想,DCE的做法其实是等价于local patch的做法的,都是分离出不同的部分,单独进行处理

作者认为DCE有2点好处

  1. 能加速模型收敛
  2. 这是一种无监督的attribute separation方式,semantic map是human parser白送的,不需要任何annotation
    Controllable Person Image Synthesis with Attribute-Decomposed GAN(CVPR20)
    下面介绍texture encoder的结构,如Fig.3所示,texture encoder其实包含了2个encoder,Learnable Encoder和VGG Encoder(pretrained on the COCO dataset),这种双Encoder的方式称为global texture encoding(GTE)

Fig.4展示了DCE和GTE的效果
Controllable Person Image Synthesis with Attribute-Decomposed GAN(CVPR20)
3.1.3 Texture style transfer

Texture style transfer的目标是将source image的texture迁移到target pose上,是联系style code和pose code的桥梁

transfer network级联了若干个style block,内部细节见Fig.2黄色框

对于第tt个style block,输入是前一个feature map Ft1F_{t-1}和full style code CstyC_{sty},通过残差的方式得到输出的feature map FtF_t
Ft=ϕt(Ft1,A)+Ft1(3) F_t=\phi_t\left ( F_{t-1}, A \right ) + F_{t-1} \qquad(3)

F0=CposeF_0=C_{pose},总共设置8个style block

Fig.2中的AA表示affine transform,输出scale μ\mu和shift σ\sigma用于执行AdaIN
Fig.2中的方框Fusion表示fusion module,包含3个fully connected layer,前两个layer用于select the desired features via linear recombination,最后一个layer用于维度变换

3.1.4 Person image reconstruction

将最后一个style block的输出送入decoder,得到生成结果IgI_g

3.2. Discriminators

参考文献[46],设置2个判别器DpD_pDtD_tDpD_p用于使IgI_g具备target pose PtP_tDtD_t用于使IgI_g的texture与IsI_s相似

对于DpD_p,假样本定义为(Pt,Ig)\left ( P_t, I_g \right ),真样本定义为(Pt,It)\left ( P_t, I_t \right )
注:数据集的特点是同一个人穿某件衣服,摆出不同pose,所以ItI_t其实是ground-truth

3.3. Training