PSGAN: Pose and Expression Robust Spatial-Aware GAN for Customizable Makeup Transfer(CVPR20)

3. PSGAN

3.1. Formulation

source image domain XX, reference image domain YY

{xn}n=1,,N,xnX\left \{ x^n \right \}_{n=1,\cdots,N}, x^n\in X{ym}m=1,,M,ymY\left \{ y^m \right \}_{m=1,\cdots,M}, y^m\in Y

domain XX上的分布PX\mathcal{P}_X,domain YY上的分布PY\mathcal{P}_Y

学习目标是一个transfer function G:{x,y}x~G:\left \{ x, y \right \}\rightarrow\tilde{x}x~\tilde{x}包含yy的makeup style,以及xx的identity

3.2. Framework

PSGAN: Pose and Expression Robust Spatial-Aware GAN for Customizable Makeup Transfer(CVPR20)
Overall

PSGAN的framework如Fig. 2所示

  1. Makeup distill network(MDNet),从reference image yy中提取makeup style,共有2个成分γ,β\gamma, \beta,称为makeup matrices
  2. Attentive makeup morphing module(AMM module),因为source image xx和reference image yy之间的expression和pose差异很大,所以提出AMM module,用于morph the two makeup matrices λ,β\lambda, \beta to two new matrices λ,β\lambda', \beta', which are adaptive to the source image by considering the similarities between pixels of the source and reference
  3. Makeup apply network(MANet),将λ,β\lambda', \beta'作用在MANet的bottleneck feature map上

Makeup distill network(MDNet)

MDNet的网络结构为StarGAN的encoder-bottleneck部分(bottleneck指residual block),负责提取 the makeup related features(如唇彩、眼影等),这些feature被表示为2个makeup matrices γ,β\gamma, \beta

如Fig.2(B)所示,MDNet的输出为feature map VyRC×H×W\mathbf{V}_\mathbf{y}\in\mathbb{R}^{C\times H\times W},后接2个并列的1x1 conv layer,得到γR1×H×W,βR1×H×W\gamma\in\mathbb{R}^{1\times H\times W}, \beta\in\mathbb{R}^{1\times H\times W}

Attentive makeup morphing module(AMM module)

因为source image xx和reference image yy之间的expression和pose差异很大,所以不能直接将γ,β\gamma, \beta直接作用在 source image xx
Q:可以认为γ,β\gamma, \beta中仍然包含reference image yy的expression和pose等信息吗?

AMM module计算一个attentive matrix ARHW×HWA\in\mathbb{R}^{HW\times HW} to specify how a pixel in the source image xx is morphed from the pixels in the reference image yy,where Ai,jA_{i,j} indicates the attentive value between the ii-th pixel xix_i in image xx and the jj-th pixel yjy_j in image yy
理解:假设在xx中position ii是眼角的位置,在yy中position jj也是眼角的位置,那么Ai,jA_{i,j}的值应该比较大,意味着x~\tilde{x}中position ii的像素值应该参考yy中position jj的像素值,才能实现较好的眼影迁移
(有个缺点,既然把HHWW乘起来了,一定程度上丢失了spatial information)

引入68个facial landmarks作为anchor points
以鼻尖处的landmark为例,对于xx的所有position,计算该position ii到鼻尖x的距离(有正有负),得到一个2维vector,于是所有68 landmark就可以得到136维向量,piR136,i=1,,H×W\mathbf{p}_i\in\mathbb{R}^{136}, i=1,\cdots,H\times W,称为relative position features
p=[f(xi)f(l1),f(xi)f(l2),,f(xi)f(l68)g(xi)g(l1),g(xi)g(l2),,g(xi)g(l68)](1) \begin{aligned} \mathbf{p}=&[ f(x_i)-f(l_1), f(x_i)-f(l_2),\cdots,f(x_i)-f(l_{68}) \\ &g(x_i)-g(l_1), g(x_i)-g(l_2),\cdots,g(x_i)-g(l_{68}) ] \qquad(1) \end{aligned}
where f()f(\cdot) and g()g(\cdot) indicate the coordinates on xx and yy axes, lil_i indicates the ii-th facial landmark
思考:p\mathbf{p}的维度应该是H×W×136H\times W\times136

既然是landmark,那么必然会存在face size的差异,因此令p\mathbf{p}单位化,即pp\frac{\mathbf{p}}{\left \| \mathbf{p} \right \|}

Moreover, to avoid unreasonable sampling pixels with similar relative positions but different semantics, we also consider the visual similarities between pixels

Fig.2©举了一个例子