文献阅读CVPR2020:Combining detection and tracking for human pose estimation in videos

 

CVPR2020:视频中人体姿态估计的组合检测与跟踪)

 

1、Background

multi-person human pose estimation and tracking in videos(视频中多人人体姿势估计和跟踪)

 

2、Question and difficult

·top-down methods do not perform as well on videos and were recently outperformed by a bottom-up approach(自顶向下的方法在视频上的表现不太好,最近被自下而上的方法所超越)

·detecting people bounding boxes in videos is a much harder task than on images(在视频中检测人的边界框比在图像上要困难得多)

·videos inherently contain atypical types of occlusion, viewpoints, motion blur and poses that make object detectors occasionally fail(视频固有地包含非典型类型的遮挡、视点、运动模糊和姿态,使得对象检测器偶尔会失败)

 

3、Existing solutions

top-down approaches, limited by the performance of its person detector(自上而下的方法,受限于人物监测器,有时在某一帧会误检测)

 

4、Main contents of the article

This is a novel top-down approach that tackles the problem of multi-person human pose estimation and tracking in videos.(这是一种新的自顶向下的方法,解决了视频中多人的姿态估计和跟踪问题。)

Main ideas:It can propagate known person locations forward and backward in time and searching for poses in those regions.(核心思想:它可以在时间上向前和向后传播已知的人的位置并在这些区域中搜索姿势。)We detect person bounding boxes on each frame and then propagate these to their neighbours.(我们检测每一帧上的人边界框,然后把这些传播给他们的邻帧。)If a person is present at a specific location in a frame, they should still be at approximately that location in the neighbouring frames, even when the detector fails to find them.(如果一个人出现在一个帧的特定位置,他们应该仍然在相邻帧的大约那个位置,即使检测器找不到他们。)

 

Innovation:

  1. a Clip Tracking Network that performs body joint detection and tracking simultaneously on small video clips;
  2. a Video Tracking Pipeline that merges the fixed-length tracklets produced by the Clip Tracking Network to arbitrary length tracks;
  3. a Spatial-Temporal Merging procedure that refines the joint locations based on spatial and temporal smoothing terms.
  4. Cluster, We use cluster in the spatial-temporal merging procedure, we cluster all these joint hypotheses and solve a spatial-temporal optimization problem on thses clusters to estimate the best location of each joint. (创新点:(1)片段跟踪网络,它对小视频片段同时进行人体关节检测和跟踪;(2)视频跟踪管道,其将由剪辑跟踪网络产生的固定长度的轨道合并到任意长度的轨道;(3)基于空间和时间平滑项来细化联合位置的空间时间合并过程。(4)聚类,我们在时空融合过程中使用聚类,对所有的联合假设进行聚类,并在聚类上求解一个时空优化问题,来估计每个关节的最佳位置)

 

Procedure:The Clip Tracking Network operates on fixed length video clips and produces multi-person pose tracklets. We combine these tracklets into pose tracks for arbitrary length videos in our Video Tracking pipeline, by first generating temporally overlapping tracklets and then associating and merging the pose detections in frames where the tracklets overlap. When merging tracklets into tracks, we use the multiple pose detections in each frame in a novel consensus-based Spatio-temporal merging procedure to estimate the optimal location of each joint. This procedure favours hypotheses that are spatially close to each other and that are temporally smooth. This combination is able to correct mistakes on highly entangled people, leading to more accurate predictions.(方法:片段跟踪网络对固定长度的视频片段进行操作,并产生多人姿势轨迹。我们在视频跟踪管道中将这些轨迹组合成任意长度视频的姿态轨迹,首先生成时间上重叠的轨迹,然后在轨迹重叠的帧中关联并合并姿态检测。当将轨迹合并到总轨迹中时,我们使用基于多点检测的时空合并过程来估计每个关节的最佳位置。这一过程有利于空间上相互接近和时间上平滑的假设。这种组合能够纠正高度纠缠的人的错误,形成更准确的预测)

 

Key Methodology:

3D HRNet:We use 3D HRNet as Clip Tracking Network for video pose estimation and tracking. The output of HRNet is a set of heatmaps, one for each body joint. Each pixel of these heatmaps indicates the likelihood of ‘containing’ a joint. (我们使用3D的HRNet作为剪辑跟踪网络去进行视频姿态估计和追踪。HRNet的输出是一组热图,每个身体关节一个。这些热图的每个像素都表示“包含”关节的可能性)

Clip Tracking Network: First, (a) our approach runs a person detector on the keyframe of a short video clip. Then, (b) for each detected person it creates a tube by cropping the region within his/her bounding box from all the frames in the clip. Next, (c) each tube is independently fed into our Clip Tracking Network (3D HRNet), which outputs pose estimates for the same person (the one originally detected in the keyframe) in all the frames of the tube. Finally, (d) we reproject the predicted poses on the original images to show how the model can correctly predict poses in all the frames of the clip, by only detecting people in the keyframe.(剪辑跟踪网络。首先,(a)我们的方法在一个短视频剪辑的关键帧上运行一个人检测器。然后,(b)对于每个检测到的人,它通过从剪辑中的所有帧裁剪他/她的边界框内的区域来创建一个管。接下来,(c)每个管被独立地馈送到我们的剪辑跟踪网络(3D HRNet),该网络在管的所有帧中输出同一个人(最初在关键帧中检测到的那个人)的姿态估计。最后,(d)我们将预测的姿势重新投影到原始图像上,以显示模型如何通过仅检测关键帧中的人来正确预测剪辑的所有帧中的姿势。)

文献阅读CVPR2020:Combining detection and tracking for human pose estimation in videos

Video Tracking Pipeline merges fixed-length tracklets into arbitrary length tracks by comparing the similarity of their detected poses in the frames the tracklets overlap on.(视频跟踪管道通过比较固定长度的轨迹在轨迹重叠的帧中检测到的姿态的相似性,将固定长度的轨迹合并成任意长度的轨迹。)

 

文献阅读CVPR2020:Combining detection and tracking for human pose estimation in videos

 

Merging pose hypotheses Our video tracking pipeline runs our Clip Tracking Network on multiple overlapping frames,producing multiple hypotheses for every joint of a person (a). We cluster these hypotheses (b) and solve a spatial-temporal optimization problem on these clusters to estimate the best location of each joint (c). This achieves better predictions than a simple baseline that always pick the hypothesis with the highest confidence score (d), especially on frames with highly entangled people.(我们的视频跟踪管道在多个重叠的帧上运行我们的剪辑跟踪网络,为一个人(a)的每个关节产生多个假设。我们对这些假设(b)进行聚类,并在这些聚类上解决时空优化问题,以估计每个关节的最佳位置(c)。这实现了比简单基线更好的预测,简单基线总是选择具有最高置信度得分(d)的假设,尤其是在具有高度纠缠的人的帧上。)

文献阅读CVPR2020:Combining detection and tracking for human pose estimation in videos

5、results comparison 

Our approach achieves state-of-the-art(SOTA) results on both metrics, on both datasets and against both top-down and bottom-up approaches. In some cases, the improvement over the SOTA is substantial: +6.5 mAP on PoseTrack2017 (which corresponds to 28% in error reduction), and +3.0 MOTA on Pose-Track2018 (9% in error reduction).

When compared to only top-down approaches, which is the category this approach belongs to, the improvement in MOTA is even more significant, up to +6.2 on PoseTrack2017 (18% in error reduction)overthe winnerofthe last PoseTrack challenge (FlowTrack,65.4 vs 71.6), showing the importance of performing jointdetection and tracking simultaneously.(我们的方法在两个指标上、在两个数据集上以及在自上而下和自下而上的方法上实现了最先进的结果。在某些情况下,相对于MOTA的改进是显著的:posetrack 2017+6.5 mAP(相当于误差减少28%),posetrack 2018+3.0 MOTA(误差减少9%)。与仅自上而下的方法(这是该方法所属的类别)相比,MOTA的改进更为显著,比上一次PoseTrackChallenge(FlowTrack,65.4 vs 71.6)高出PoseTrack2017的+6.2(误差减少18%),显示了同时执行联合检测和跟踪的重要性。

Notejoint detection performance is expressed in terms of average precision (AP), while tracking performance in terms of multi object tracking accuracy (MOTA)(联合检测性能用平均精度表示,而跟踪性能用多目标跟踪精度表示(MOTA))

 

6、Personal Thinking

For solve the difficult of entangled,occlusion and failed person detection, This paper propose a method that use neighbor frame to confirm person’s location, and we enlarge each bounding box by 25% along both dimensions prior to creating a tube, finally we solve this problem better.For the same person associated between frames, the 2D convolution in the first two stage of HRNet is expanded to 3D to help the network learn and track. In the final estimation of human joint pose, the point with the highest confidence is no longer used, but the clustering method is used to cluster according to different weights to get an optimal point, which also improves the accuracy of the method.(为了解决遮挡、人物监测失败等问题,该方法提出使用相临帧中已经判断好的人物框去判断该帧中同一人物的位置,在创建管道之前,扩大25%二维的边界框,较好的解决了这个问题;对于在帧之间关联同一个人,文中采用将HRNet前两个阶段中2D卷积膨胀到3D,以帮助网络学习跟踪;在最后人体关节姿态的估计中,不再使用置信度最高的点,而是使用聚类的方法,按照不同的权重进行聚类,得到一个最优的点,也使得该方法准确度提升很多。)