paper阅读总结篇(1):Everybody Dance Now

Motion Transfer


Everybody Dance Now是一篇基于pose的keypoint estimation来进行motion transfer,核心思路是对每一个人都生成一个Generator,这个generator可以根据输入的火柴人pose生成真人的视频。另一个出彩的一点就是Global pose normalization




. Given two videos – one of a target person whose appearance we wish to synthesize, and the other of a source subject whose motion we wish to impose onto our target person.


We add two components to improve the quality of our results: To encourage the temporal smoothness of our generated videos, we condition the prediction at each frame on that of the previous time step. To increase facial realism in our results we include a specialized GAN trained to generate the target person‘s face.


we divide our pipeline into three stages – pose detection, global pose normalization, and mapping from normalized pose stick figures to the target subject.
1): (x,y)(x,y)是一对本身的pair,其中xx是火柴人figure,从yy中提取的提取的关键点连线画出来的结果,yy本身从dataset中提取出来。
2): P是现有的pose detector,比如open pose之类的model,直接拿pretrain好的model做出来火柴人
3): G是根据火柴人figure去生成和yy一样的图片,然后D的目标是区分(x,y)(x,y)(x,G(x))(x,G(x)),判定(x,y)(x,y)为real,另一对是fake。然后区分G(x)G(x)yy本身的差距,引用了pretrain的VGG去算feature之间的差距.
4): 现在给你一个新人物和他的pose yy^{'},首先我们可以通过P生成火柴人pose,然后要进行normalization处理去生成新的G(x)G(x) pose。首先人变成了yy这个人,其次他拥有了yy^{'}的pose,然后就成功实现了motion transfer。

(4): Global Pose Normalization Details

When transferring motion between two subjects, it may be necessary to transform the pose keypoints of the source person so that they appear in accordance with the target person’s body shape and proportion as in the Transfer section of Figure 3.
原因:人物之间的位置,人的身体比例,摄像头的位置距离的不同,可能都会对我们生成的结果造成影响,所以便提出了人体的normalization的具体想法。具体细节在paper的9.1 提出。transformation是根据source和target来进行计算的,并不是根据单独一个人来算。
To find a transformation in terms of scale and translation between a source pose and a target pose, we find the minimum and maximum ankle positions in image coordinates of each subject while they are on the ground (i.e. feet raised in the air are not considered).These coordinates represent the farthest and closest distances to the camera respectively). The maximum ankle position is the foot coordinate closest to the bottom of the image.

loss fucntion


