VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry 语义视觉定位与里程计

本博客仅为作者记笔记之用,不对之处,望见谅,欢迎批评指正。
更多相关博客请查阅:http://blog.****.net/weixin_39779106
如需转载,请附上本文链接:http://blog.****.net/weixin_39779106/article/details/79689208

原论文链接
项目网站及Demo链接

一、摘要

VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry 语义视觉定位与里程计

原文摘要 翻译
In this paper, we propose the novel VLocNet++ architecture that attempts to overcome this limitation by simultaneously embedding geometric and semantic knowledge of the world into the pose regression network. 本文提出了VLocNet框架,通过将几何信息和语义信息嵌入到位姿回归网络来提高整个框架的性能。
We adopt a multitask learning approach that exploits the inter-task relationship between learning semantics, regressing 6-DoF global pose and odometry, for the mutual benefit of each of these tasks. VLoc- Net++ incorporates the Geometric Consistency Loss function that utilizes the predicted motion from the odometry stream to enforce global consistency during pose regression. 我们采用多任务学习方法,利用学习语义信息,6*度全局位姿回归以及里程信息这三个任务间的关系提高全局一致性。
Furthermore, we propose a self-supervised warping technique that uses the relative motion to warp intermediate network representations in the segmentation stream for learning consistent semantics. 同时,我们提出了一种自监督的变形技术,该技术在分割过程中改变网络中间层的描述子从而学习一致的语义信息。
In addition, we propose a novel adaptive weighted fusion layer to leverage inter and intra task dependencies based on region activations. 此外,本文还提出了一种自适应加权融合层,该层基于区域**从而对内部和外部任务之间的依赖进行使用。
Finally, we introduce a first-of-a-kind urban outdoor localization dataset with pixel-level semantic labels and multiple loops for training deep networks. 最后首次提出了一类带有像素级语义标注以及多个闭环的户外城市环境数据集用于训练DNN。

二、介绍

VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry 语义视觉定位与里程计VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry 语义视觉定位与里程计

原文摘要 翻译
In this work, we focus on jointly learning three diverse vital tasks that are crucial for robot autonomy, namely, semantic segmentation, visual localization and odometry, from consecutive monocular images. This problem is extremely challenging as it involves simultaneously learning cross-domain tasks that perform pixelwise classification and regression with different units and scales. 本文主要关注如何通过连续单目图像联合学习对于机器人自主十分重要的三项任务,即语义分割、视觉定位以及里程计。这是十分困难的,因为需要同时学习使用不同单位和尺度的像素级分类和回归这两个跨域任务。
However, these tasks inherently share complex interdependencies that we aim to exploit using our MTL framework, thereby enabling inter-task learning which improves generalization capabilities and alleviates the problem of requiring vast amounts of labeled training data, that is especially hard to obtain in the robotics domain. 然而,这些跨域任务本质上拥有很强的相互依赖关系,因此我们可以利用提出的MTL框架实现任务间相互学习,从而提高泛化能力,并缓解需要大量带标记训练数据的问题,这在机器人技术领域中十分难得。
However, one of the critical shortfalls is that it does not allow the network to utilize the learned motion specific features from the previous timestep. To address this limitation, we build upon the VLocNet model and employ an adaptive weighting approach to aggregate motion-specific temporal information for improving the pose prediction accuracy. 上一篇工作的缺点在于,它不允许网络利用以前学习到的特定运动特征。为了解决这个限制,本文在VLocNet模型的基础上,采用自适应加权方法来聚合特定运动的时间信息,从而提高姿态预测的准确性。
Our motivation for jointly estimating semantics is based on the premise that it can instill structural cues about the environment into the pose regression network and implicitly pull the attention towards more informative regions in the scene. A popular paradigm employed for semantics-aware localization is to extract predefined features, emphasize on stable features [9] or combine them with local features [10]. Although these handcrafted solutions have demonstrated considerable reliability, their performance suffers substantially when the predefined structures are occluded or not visible in the scene. In contrast, our proposed adaptive fusion layer is able to fuse relevant features not only based on the semantic category but also the activations in the region. 我们使用联合语义估计的动机和依据是,这样做可以将关于环境的结构线索灌输到位姿回归网络中,并隐含地将注意力引向场景中拥有更多信息的区域。在语义层面定位问题中比较流行的范式是提取预定义特征,增强稳定特征[9]或将它们与局部特征相结合[10]。尽管这些传统手工方案已经具有了比较可靠的性能,但当预定义结构被遮挡或不可见时,它们的性能会受到很大影响。相比之下,我们提出的自适应融合层能够融合相关特征,不仅是在语义层面,而且还融合了场景中的活跃区域。
原文摘要 翻译
we propose a novel self-supervised semantic aggregation technique leveraging the predicted motion from the odometry stream of our network. 基于网络里程计部分预测的运动,我们提出了一种自监督的语义回归技术。
we fuse intermediate network representations from the previous timestep into the current frame using our proposed adaptive weighted fusion layer. This enables our semantic segmentation network to aggregate more scene-level global context, thereby improving the performance and leading to faster convergence. 我们利用提出的权重自适应融合层将先前得到的网络中间层描述子融合到当前帧中,这使得我们的语义分割网络可以汇总更多场景级的全局内容,从而能提高网路的性能并使其快速收敛。

个人理解1:本段总体概括了文章内容,即提出了多任务融合网络、目前性能最好的位姿回归CNN结构、自监督的内容汇集技术、自适应加权层以及首个户外大范围的带多闭环和像素级语义标注的数据集。
个人理解2:对比了上一篇工作(
Deep Auxiliary Learning for Visual Localization and Odometry 基于深度辅助学习的视觉定位和里程计),指出上个工作的问题在于不能使用历史信息并在此篇工作中进行了改进。