Abstract

Two less addressed issues of deep reinforcement learning are (1) lack of generalization capability to new target goals, and (2) data inefficiency i.e., the model requires several (and often costly) episodes of trial and error to converge, which makes it impractical to be applied to real-world scenarios. In this paper, we address these two issues and apply our model to the task of target-driven visual navigation. To address the first issue, we propose an actor-critic model whose policy is a function of the goal as well as the current state, which allows to better generalize. To address the second issue, we propose AI2- THOR framework, which provides an environment with highquality 3D scenes and physics engine. Our framework enables agents to take actions and interact with objects. Hence, we can collect a huge number of training samples efficiently. We show that our proposed method (1) converges faster than the state-of-the-art deep reinforcement learning methods, (2) generalizes across targets and across scenes, (3) generalizes to a real robot scenario with a small amount of fine-tuning (although the model is trained in simulation), (4) is end-to-end trainable and does not need feature engineering, feature matching between frames or 3D reconstruction of the environment.

深度强化学习中两个较少被提及的问题是:(1)缺乏对新目标目标的泛化能力;，该模型需要多次(而且往往代价高昂)的反复试验和错误来收敛，这使得将其应用于实际场景是不切实际的。在本文中，我们解决了这两个问题，并将我们的模型应用到目标驱动的视觉导航任务中。为了解决第一个问题，我们提出了一个行为批评模型，它的策略是目标和当前状态的函数，这样可以更好地一般化。为了解决第二个问题，我们提出了AI2- THOR框架，它提供了一个具有高质量3D场景和物理引擎的环境。我们的框架允许代理采取行动并与对象交互。因此，我们可以有效地收集大量的训练样本。我们表明,我们建议的方法(1)收敛速度比最先进的强化学习方法快,(2)概括跨越目标和场景,(3) 用少量的微调来推广到一个真正的机器人的场景 (尽管模型在模拟训练),(4)是端到端可训练的和不需要特征工程、帧间特征匹配或环境三维重建。

III. AI2-THOR FRAMEWORK

To train and evaluate our model, we require a framework for performing actions and perceiving their outcomes in a 3D environment. Integrating our model with different types of environments is a main requirement for generalization of our model. Hence, the framework should have a plugn- play architecture such that different types of scenes can be easily incorporated. Additionally, the framework should have a detailed model of the physics of the scene so the movements and object interactions are properly represented.

为了训练和评估我们的模型，我们需要一个在3D环境中执行操作和感知其结果的框架。将模型与不同类型的环境集成是模型泛化的主要要求。因此，这个框架应该有一个即插即用的架构，这样不同类型的场景就可以很容易地合并。此外，框架应该有一个场景物理的详细模型，这样运动和对象交互就能得到适当的表示。

For this purpose, we propose The House Of inteRactions (AI2-THOR) framework, which is designed by integrating a physics engine (Unity 3D) 1 with a deep learning framework (Tensorflow [44]). The general idea is that the rendered images of the physics engine are streamed to the deep learning framework, and the deep learning framework issues a control command based on the visual input and sends it back to the agent in the physics engine. Similar frameworks have been proposed by [36], [37], [41], [39], [38], but the main advantages of our framework are as follows: (1) The physics engine and the deep learning framework directly communicate (in contrast to separating the physics engine from the controller as in [35]). Direct communication is important since the feedback from the environment can be immediately used for online decision making. (2) We tried to mimic the appearance distribution of the real-world images as closely as possible. For example, [36] work on Atari games, which are 2D environments and limited in terms of appearance or [40] is a collection of synthetic scenes that are non-photo-realistic and do not follow the distribution of real-world scenes in terms of lighting, object appearance, textures, and background clutter, etc. This is important for enabling us to generalize to real-world images.

为此，我们提出交互之家(AI2-THOR)框架，它是通过集成物理引擎(Unity 3D) 1和深度学习框架(Tensorflow[44])而设计的。其基本思想是将物理引擎渲染的图像流到深度学习框架中，深度学习框架根据视觉输入发出控制命令，并将其发送回物理引擎中的代理。[36]、[37]、[41]、[39]、[38]也提出了类似的框架，但我们的框架的主要优势是:(1)物理引擎与深度学习框架直接通信(不同于[35]中物理引擎与控制器分离)。直接沟通很重要，因为来自环境的反馈可以立即用于在线决策。(2)我们尽量模拟真实图像的外观分布。例如，[36]在雅达利游戏中工作，这是一个二维环境，在外观上是有限的，或者[40]是一个合成场景的集合，它是非照片真实感的，不遵循现实场景在灯光、物体外观、纹理、背景杂波等方面的分布。这对于使我们能够将其推广到真实的图像中非常重要。

To create indoor scenes for our framework, we provided reference images to artists to create a 3D scene with the texture and lighting similar to the image. So far we have 32 scenes that belong to 4 common scene types in a household environment: kitchen, living room, bedroom, and bathroom. On average, each scene contains 68 object instances.

为了给我们的框架创建室内场景，我们为艺术家提供了参考图像，以创建一个纹理和灯光与图像相似的3D场景。到目前为止，我们有32个场景，属于家庭环境中的4种常见场景类型:厨房、客厅、卧室和浴室。平均每个场景包含68个对象实例。

The advantage of using a physics engine for modeling the world is that it is highly scalable (training a robot in real houses is not easily scalable). Furthermore, training the models can be performed cheaper and safer (e.g., the actions of the robot might damage objects). One main drawback of using synthetic scenes is that the details of the real world are under-modeled. However, recent advances in the graphics community make it possible to have a rich representation of the real-world appearance and physics, narrowing the discrepancy between real world and simulation. Fig. 2 provides a qualitative comparison between a scene in our framework and example scenes in other frameworks and datasets. As shown, our scenes better mimic the appearance properties of real world scenes. In this work, we focus on navigation, but the framework can be used for more fine-grained physical interactions, such as applying a force, grasping, or object manipulations such as opening and closing a microwave. Fig. 3 shows a few examples of high-level interactions. We will provide Python APIs with our framework for an AI agent to interact with the 3D scenes.

使用物理引擎为世界建模的优点是它具有高度可伸缩性(在真实的房子中训练机器人是不容易伸缩的)。此外，训练模型可以更便宜、更安全(例如，机器人的动作可能会损坏物体)。使用合成场景的一个主要缺点是对真实世界的细节建模不足。然而，图形社区的最新进展使得对真实世界的外观和物理具有丰富的表示成为可能，缩小了真实世界和模拟之间的差异。图2给出了我们框架中的场景与其他框架和数据集中的示例场景的定性比较。如图所示，我们的场景更好地模拟了真实世界场景的外观属性。在这项工作中，我们主要关注导航，但是这个框架可以用于更细粒度的物理交互，比如应用一个力、抓取，或者对象操作，比如打开和关闭一个微波炉。图3显示了一些高层交互的例子。我们将为Python api提供我们的框架，让AI代理与3D场景进行交互。

IV. TARGET-DRIVEN NAVIGATION MODEL

In this section, we first define our formulation for targetdriven visual navigation. Then we describe our deep siamese actor-critic network for this task.

在本节中，我们首先定义用于目标驱动的可视化导航的公式。然后我们为这项任务描述我们的孪生演员-批评家网络（？）。

Siamese network 孪生神经网络

Siamese network就是“连体的神经网络”，神经网络的“连体”是通过共享权值来实现的，如下图所示。

Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning学习笔记

共享权值是左右两个神经网络的权重一模一样，在代码实现的时候，甚至可以是同一个网络，不用实现另外一个，因为权值都一样。对于siamese network，两边可以是lstm或者cnn，都可以。

如果左右两边不共享权值，而是两个不同的神经网络，叫pseudo-siamese network，伪孪生神经网络，如下图所示。对于pseudo-siamese network，两边可以是不同的神经网络（如一个是lstm，一个是cnn），也可以是相同类型的神经网络。

Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning学习笔记

孪生神经网络的用途是什么？

简单来说，衡量两个输入的相似程度。孪生神经网络有两个输入（Input1 and Input2）,将两个输入feed进入两个神经网络（Network1 and Network2），这两个神经网络分别将输入映射到新的空间，形成输入在新的空间中的表示。通过Loss的计算，评价两个输入的相似度。

随着SVM等算法的兴起，neural network被人们遗忘，还好有一些执着的人们，坚守在了神经网络研究的阵地。2010年Hinton在ICML上发表了文章《Rectified Linear Units Improve Restricted Boltzmann Machines》，用来做人脸验证，效果很好。其原理很简单，将两个人脸feed进卷积神经网络，输出same or different。

孪生神经网络用于处理两个输入"比较类似"的情况。伪孪生神经网络适用于处理两个输入"有一定差别"的情况。比如，我们要计算两个句子或者词汇的语义相似度，使用siamese network比较适合；如果验证标题与正文的描述是否一致（标题和正文长度差别很大），或者文字是否描述了一幅图片（一个是图片，一个是文字），就应该使用pseudo-siamese network。也就是说，要根据具体的应用，判断应该使用哪一种结构，哪一种Loss。

Siamese network loss function一般用哪一种呢？

Softmax当然是一种好的选择，但不一定是最优选择，即使是在分类问题中。传统的siamese network使用Contrastive Loss。损失函数还有更多的选择，siamese network的初衷是计算两个输入的相似度,。左右两个神经网络分别将输入转换成一个"向量"，在新的空间中，通过判断cosine距离就能得到相似度了。Cosine是一个选择，exp function也是一种选择，欧式距离什么的都可以，训练的目标是让两个相似的输入距离尽可能的小，两个不同类别的输入距离尽可能的大。其他的距离度量没有太多经验，这里简单说一下cosine和exp在NLP中的区别。

根据实验分析，cosine更适用于词汇级别的语义相似度度量，而exp更适用于句子级别、段落级别的文本相似性度量。其中的原因可能是cosine仅仅计算两个向量的夹角，exp还能够保存两个向量的长度信息，而句子蕴含更多的信息（当然，没有做实验验证这个事情）。

我们在论文里使用了exp距离做多分类，解决Fakenewschallenge上标题与正文立场是否一致的衡量问题。[1]

A. Problem Statement

Our goal is to find the minimum length sequence of actions that move an agent from its current location to a target that is specified by an RGB image. We develop a deep reinforcement learning model that takes as input an RGB image of the current observation and another RGB image of the target. The output of the model is an action in 3D such as move forward or turn right. Note that the model learns a mapping from the 2D image to an action in the 3D space.

我们的目标是找到将代理从当前位置移动到RGB映像指定的目标的最小长度操作序列。我们开发了一个深度强化学习模型，该模型以当前观测的RGB图像和目标的RGB图像作为输入。模型的输出是3D中的一个动作，例如向前移动或右转。注意，模型学习了从2D图像到3D空间中的一个动作的映射。

B. Problem Formulation

Vision-based robot navigation requires a mapping from sensory signals to motion commands. Previous work on Reinforcement Learning typically do not consider highdimensional perceptual inputs [45]. Recent deep reinforcement learning (DRL) models [2] provide an end-to-end learning framework for transforming pixel information into actions. However, DRL has largely focused on learning goalspecific models that tackle individual tasks in isolation. This training setup is rather inflexible to changes in task goals. For instance, as pointed out by Lake et al. [46], changing the rule of the game would have devastating performance impact on DRL-based Go-playing systems [30]. Such limitation roots from the fact that standard DRL models [2], [3] aim at finding a direct mapping (represented by a deep neural network π) from state representations s to policy π(s). In such cases, the goal is hardcoded in neural network parameters. Thus, changes in goals would require to update the network parameters in accordance.

基于视觉的机器人导航需要从感知信号到运动指令的映射。以往关于强化学习的研究通常不考虑高维感知输入[45]。最近的深度强化学习(DRL)模型[2]提供了一个端到端学习框架，用于将像素信息转换为动作。然而，DRL主要关注于学习单独处理单个任务的特定于目标的模型。这种训练设置对于任务目标的更改相当不灵活。例如，Lake等人指出，改变游戏规则将对基于drl的Go-playing系统[30]产生毁灭性的性能影响。这种局限性根源于标准DRL模型[2]、[3]的目标是从状态表示s到策略π(s)的直接映射(由深度神经网络π表示)。在这种情况下，目标是在神经网络参数中的硬编码。因此，目标的改变将需要根据需要更新网络参数。

ALPHAGO

[30] SILVER D, HUANG A, MADDISON C J等. Mastering the game of Go with deep neural networks and tree search[J]. Nature, 2016, 529(7587): 484–489.

Go-playing指的是ALPHAGO

Such limitation is especially problematic for mobile robot navigation. When applying DRL to the multiple navigation targets, the network should be re-trained for each target. In practice, it is prohibitive to exhaust every target in a scene. This is the problem caused by a lack of generalization i.e., we would have to re-train a new model when incorporating new targets. Therefore, it is preferable to have a single navigation model, which learns to navigate to new targets without re-training. To achieve this, we specify the task objective (i.e., navigation destination) as inputs to the model, instead of implanting the target in the model parameters. We refer to this problem as target-driven visual navigation. Formally, the learning objective of a target-driven model is to learn a stochastic policy function π which takes two inputs, a representation of current state st and a representation of target g and produces a probability distribution over the action space π(st,g). For testing, a mobile robot keeps taking actions drawn from the policy distribution until reaching the destination. This way, actions are conditioned on both states and targets. Hence, no re-training for new targets is required.

这种限制对于移动机器人导航来说尤其成问题。当将DRL应用于多个导航目标时，应该对每个目标重新进行网络训练。在实践中，禁止用尽场景中的每个目标。这是由于缺乏概括而引起的问题。即，在纳入新目标时，我们必须重新培训一种新模式。因此，最好有一个单独的导航模型，它可以学习在不经过重新训练的情况下导航到新的目标。为此，我们指定任务目标(即，导航目的地）作为模型的输入，而不是将目标植入模型参数中。我们将这个问题称为目标驱动的可视化导航。形式上，目标导向模型的学习目标是学习随机策略函数π需要两个输入,一个表示当前状态st和目标的表示g并在行为空间π(st,g)产生一个概率分布。在测试中，移动机器人从策略分布中提取动作，一直执行到到达目的地。这样，动作就同时受到状态和目标的制约。因此，不需要为新的目标进行再培训。

C. Learning Setup

Before introducing our model, we first describe the key ingredients of the reinforcement learning setup: action space, observations and goals, and reward design.

在介绍我们的模型之前，我们首先描述强化学习设置的关键组成部分:行动空间、观察和目标，以及奖励设计。

1) Action space: Real-world mobile robots have to deal with low-level mechanics. However, such mechanical details make the learning significantly more challenging. A common approach is to learn at a certain level of abstraction, where the underlying physics is handled by a lower-level controller (e.g., 3D physical engine).We train our model with command-level actions. For our visual navigation tasks, we consider four actions: moving forward, moving backward, turning left, and turning right. We use a constant step length (0.5 meters) and turning angle (90 degree). This essentially discretizes the scene space into a grid-world representation. To model uncertainty in real-world system dynamics, we add a Gaussian noise to steps N (0,0.01) and turns N (0,1.0) at each location.

1)动作空间:现实世界中的移动机器人需要处理底层的力学问题。然而，这些机械的细节使学习更具挑战性。一种常见的方法是在一定的抽象级别上学习，其中底层的物理由较低级别的控制器(例如，3D物理引擎)处理。我们用命令级的操作来训练我们的模型。对于视觉导航任务，我们考虑四种操作:向前、向后、向左和向右。我们使用恒定的步长(0.5米)和转角(90度)。这实际上是将场景空间离散为网格世界表示。为了模拟真实系统动力学中的不确定性，我们在步骤N(0,0.01)和旋转N(0,1.0)的每个位置添加高斯噪声。（？）

2) Observations and Goals: Both observations and goals are images taken by the agent's RGB camera in its firstperson view. The benefit of using images as goal descriptions is the flexibility for specifying new targets. Given a target image, the task objective is to navigate to the location and viewpoint where the target image is taken.

2)观察和目标:观察和目标都是agent的RGB相机在第一人称视图中拍摄的图像。使用图像作为目标描述的好处是可以灵活地指定新的目标。给定一个目标图像，任务目标是导航到拍摄目标图像的位置和视点。

3) Reward design: We focus on minimizing the trajectory length to the navigation targets. Other factors such as energy efficiency could be considered instead. Therefore, we only provide a goal-reaching reward (10.0) upon task completion. To encourage shorter trajectories, we add a small time penalty (-0.01) as immediate reward.

3)奖励设计:我们专注于最小化到导航目标的轨迹长度。可以考虑其他因素，如能源效率。因此，我们只在任务完成时提供目标达成奖励(10.0)。为了鼓励更短的轨迹，我们增加了一个小的时间惩罚(-0.01)作为即时奖励。

D. Model
We focus on learning the target-driven policy function π via deep reinforcement learning. We design a new deep neural network as a non-linear function approximator for π, where action a at time t can be drawn by:

a～π( st , g | u )

where u are the model parameters, st is the image of the current observation, and g is the image of the navigation target. When target g belongs to a finite discrete set, π can be seen as a mixture model, where g indexes the right set of parameters for each goal. However, the number of real-world goals is often countless (due to many different locations or highly variable object appearances). Thus, it is preferable to learn a projection that transforms the goals into an embedding space. Such projection enables knowledge transfer across this embedding space, and therefore allows the model to generalize to new targets.

D.模型

我们通过深度强化学习来注重学习目标导向策略功能π。我们为π设计一个新的深层神经网络作为非线性函数近似者,使在t时间的行动a可以被表示为:

a～π( st , g | u )

其中u为模型参数，st为当前观测图像，g为导航目标图像。当目标g属于一组有限的离散,π可以被看作是一个混合模型,g索引每个目标的正确参数集。然而，现实目标的数量常常是无数的(由于许多不同的位置或高度可变的对象外观)。因此，最好学习将目标转换为嵌入空间的投影。这样的投影使得知识能够在这个嵌入空间中传递，因此允许模型推广到新的目标。

Navigational decisions demand an understanding of the relative spatial positions between the current locations and the target locations, as well as a holistic sense of scene layout. We develop a new deep siamese actor-critic network to capture such intuitions. Fig. 4 illustrates our model for the target-driven navigation tasks. Overall, the inputs to the network are two images that represent the agent‘s current observation and the target. Our approach to reasoning about the spatial arrangement between the current location and the target is to project them into the same embedding space, where their geometric relations are preserved. Deep siamese networks are a type of two-stream neural network models for discriminative embedding learning [47]. We use two streams of weight-shared siamese layers to transform the current state and the target into the same embedding space. Information from both embeddings is fused to form a joint representation. This joint representation is passed through scene-specific layers (refer to Fig. 4). The intention to have scene-specific layers is to capture the special characteristics (e.g., room layouts and object arrangements) of a scene that are crucial for the navigation tasks. Finally, the model generates policy and value outputs similar to the advantage actor-critic models [3]. In this model, targets across all scenes share the same generic siamese layers, and all targets within a scene share the same scene-specific layer. This makes the model better generalize across targets and across scenes.

导航决策需要理解当前位置和目标位置之间的相对空间位置，以及对场景布局的整体感觉。我们开发了一个新的深暹罗演员-评论家网络来捕捉这些直觉。图4展示了目标驱动导航任务的模型。总的来说，网络的输入是代表代理当前观察和目标的两个图像。我们对当前位置和目标之间的空间排列进行推理的方法是将它们投射到相同的嵌入空间中，并保留它们的几何关系。深度siamese网络是一种用于区分嵌入学习[47]的两流神经网络模型。我们使用两个共享权重的siamese层流将当前状态和目标转换为相同的嵌入空间。来自两个嵌入的信息被融合，形成一个联合表示。这种联合表示是通过场景特定层(参见图4)来实现的。设置场景特定层的目的是捕捉对导航任务至关重要的场景的特殊特征(如房间布局和对象安排)。最后，该模型生成与优势行为者批评模型[3]类似的策略和价值输出。在这个模型中，所有场景中的目标共享相同的通用siamese层，一个场景中的所有目标共享相同的场景特定层。这使得模型能够更好地跨目标和跨场景进行推广。

E. Training Protocol
Traditional RL models learn for individual tasks in separation, resulting in the inflexibility with respect to goal changes. As our deep siamese actor-critic network shares parameters across different tasks, it can benefit from learning with multiple goals simultaneously. A3C [3] is a type of reinforcement learning model that learns by running multiple copies of training threads in parallel and updates a shared set of model parameters in an asynchronous manner. It has been shown that these parallel training threads stabilize each other, achieving the state-of-the-art performance in the videogame domain. We use a similar training protocol as A3C.

E .训练协议

传统的RL模型在分离的情况下学习单个任务，导致目标更改方面缺乏灵活性。由于我们深暹罗演员-评论家网络在不同的任务*享参数，它可以从同时具有多个目标的学习中受益。A3C[3]是一种强化学习模型，它通过并行运行多个训练线程副本进行学习，并以异步方式更新一组共享的模型参数。这些并行训练线程相互稳定，在视频游戏领域达到了最先进的性能。我们使用与A3C类似的训练协议。

However, rather than running copies of a single game, each thread runs with a different navigation target. Thus, gradients are backpropagated from the actor-critic outputs back to the lower-level layers. The scene-specific layers are updated by gradients from the navigation tasks within the scene, and the generic siamese layers are updated by all targets.

然而，不同于运行单一游戏的副本，每个线程都使用不同的导航目标运行。因此，梯度将被从actor-评论家的输出反向传播回较低级别的层。场景特定的层由来自场景中的导航任务的梯度进行更新，通用的siamese层由所有目标进行更新。

F. Network Architectures
The bottom part of the siamese layers are ImageNet-pretrained ResNet-50 [48] layers (truncated the softmax layer) that produce 2048-d features on a 2242243 RGB image. We freeze these ResNet parameters during training. We stack 4 history frames as state inputs to account for the agent's previous motions. The output vectors from both streams are projected into the 512-d embedding space. The fusion layer takes a 1024-d concatenated embedding of the state and the target, generating a 512-d joint representation. This vector is passed through two fully-connected scenespecific layers, producing 4 policy outputs (i.e., probability over actions) and a single value output. We train this network with a shared RMSProp optimizer of learning rate 710.4.

f .网络体系结构

siamese层的底部是imagenet预先训练的ResNet-50[48]层(截断softmax层)，它在2242243 RGB图像上生成2048-d特性。我们在训练中冻结这些ResNet参数。我们将4个历史帧堆栈为状态输入，以解释代理以前的动作。将两个流的输出向量投影到512-d的嵌入空间中。融合层采用1024-d的状态级联嵌入和目标级联嵌入，生成512-d的联合表示。该向量通过两个完全连接的场景特定层，生成4个策略输出(即，概率/动作)和单个值输出。我们使用学习率为710.4的共享RMSProp优化器来训练这个网络。

ImageNet

ImageNet项目是一个用于视觉对象识别软件研究的大型可视化数据库。超过1400万的图像URL被ImageNet手动注释，以指示图片中的对象;在至少一百万个图像中，还提供了边界框。ImageNet包含2万多个类别; [2]一个典型的类别，如“气球”或“草莓”，包含数百个图像。第三方图像URL的注释数据库可以直接从ImageNet免费获得;但是，实际的图像不属于ImageNet。

自2010年以来，ImageNet项目每年举办一次软件比赛，即ImageNet大规模视觉识别挑战赛（ILSVRC），软件程序竞相正确分类检测物体和场景。ImageNet挑战使用了一个“修剪”的1000个非重叠类的列表。2012年在解决ImageNet挑战方面取得了巨大的突破，被广泛认为是2010年的深度学习革命的开始。

ImageNet就像一个网络一样，拥有多个Node（节点）。每一个node相当于一个item或者subcategory。据官网消息，一个node含有至少500个对应物体的可供训练的图片/图像。它实际上就是一个巨大的可供图像/视觉训练的图片库。

ImageNet的结构基本上是金字塔型：目录->子目录->图片集。

该数据库首次作为一个海报在普林斯顿大学计算机科学系的研究人员在佛罗里达州举行的2009年计算机视觉与模式识别（CVPR）会议上发布。[2]

ImageNet 预训练是否有必要？

事实并非如此，如果我们有足够的目标数据和计算资源的话，也许我们可以不依赖 ImageNet 的预训练。我们的实验结果表明，ImageNet 预训练可以帮助模型加速收敛过程，但是并不一定能提高最终的准确性，除非数据集特别小（例如，<10k COCO images）。这表明，在未来的研究中，收集目标数据的标注信息（而不是预训练数据）对于改善目标任务的表现是更有帮助的。

ImageNet有用吗？

确实是有用的。ImageNet 预训练一直以来是计算机视觉领域许多任务性能辅助工具。它能够减少了训练的周期，更容易获得有前途的结果，经预训练的模型能够多次使用，训练成本很低。此外，经预训练的模型能够有更快的收敛速度。我们相信 ImageNet 预训练仍然有助于计算机视觉研究。[3]

https://blog.csdn.net/lanran2/article/details/79057994

ResNet

ResNet（Residual Neural Network）由微软研究院的Kaiming He等四名华人提出，通过使用ResNet Unit成功训练出了152层的神经网络，并在ILSVRC2015比赛中取得冠军，在top5上的错误率为3.57%，同时参数量比VGGNet低，效果非常突出。ResNet的结构可以极快的加速神经网络的训练，模型的准确率也有比较大的提升。同时ResNet的推广性非常好，甚至可以直接用到InceptionNet网络中。

ResNet的主要思想是在网络中增加了直连通道，即Highway Network的思想。此前的网络结构是性能输入做一个非线性变换，而Highway Network则允许保留之前网络层的一定比例的输出。ResNet的思想和Highway Network的思想也非常类似，允许原始输入信息直接传到后面的层中，如下图所示。

Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning学习笔记

这样的话这一层的神经网络可以不用学习整个的输出，而是学习上一个网络输出的残差，因此ResNet又叫做残差网络。[4]

V. EXPERIMENTS

Our main objective for target-driven navigation is to find the shortest trajectories from the current location to the target. In this section, we first evaluate our model with baseline navigation models that are based on heuristics and standard deep RL models. One major advantage of our proposed model is the ability to generalize to new scenes and new targets. We conduct two additional experiments to evaluate the ability of our model to transfer knowledge across targets and across scenes. Also, we show an extension of our model to continuous space. Lastly, we demonstrate the performance of our model in a complex real setting using a real robot.

目标驱动导航的主要目标是找到从当前位置到目标的最短轨迹。在本节中，我们首先使用基于启发式和标准深度RL模型的基线导航模型来评估我们的模型。我们提出的模型的一个主要优点是能够推广到新的场景和新的目标。我们进行了另外两个实验来评估我们的模型跨目标和跨场景传输知识的能力。同时，我们将模型扩展到连续空间。最后，我们使用一个真实的机器人来演示我们的模型在复杂的实际环境中的性能。

A. Navigation Results

We implement our models in Tensorflow [44] and train them on an Nvidia GeForce GTX Titan X GPU. We follow the training protocol described in Sec. IV-E to train our deep siamese actor-critic model (see Fig. 4) with 100 threads, each thread learns for a different target. It takes around 1.25 hours to pass through one million training frames across all threads. We report the performance as the average number of steps (i.e., average trajectory length) it takes to reach a target from a random starting point. The navigation performance is reported on 100 different goals randomly sampled from 20 indoor scenes in our dataset. We compare our final model with heuristic strategies, standard deep RL models, and variations of our model. The models we compare are:

我们在Tensorflow[44]中实现我们的模型，并在Nvidia GeForce GTX Titan X GPU上训练它们。我们遵循第IV-E节中描述的训练协议，用100个线程训练我们的深层暹罗参与者-批评者模型(见图4)，每个线程学习不同的目标。在所有线程上通过100万个训练帧大约需要1.25小时。我们将性能报告为从随机起点到达目标的平均步骤数(即，平均轨迹长度)。在我们的数据集中，随机抽取20个室内场景，对100个不同目标的导航性能进行了报告。我们将最终的模型与启发式策略、标准的深度RL模型以及模型的变体进行了比较。我们比较的模型是:

1) Random walk is the simplest heuristic for navigation. In this baseline model, the agent randomly draws one out of four actions at each step.

1)随机游走是最简单的导航启发式。在这个基线模型中，代理在每个步骤随机抽取四个操作中的一个。

2) Shortest Path provides an upper-bound performance for our navigation model. As we discretize the walking space by a constant step length (see Sec. IV-C), we can compute the shortest paths from the starting locations to the target locations. Note that for computing the shortest path, we have access to the full map of the environment, while the input to our system is just an RGB image.

2)最短路径为我们的导航模型提供了一个上界性能。当我们用一个恒定的步长离散步行空间时(参见第IV-C节)，我们可以计算从起始位置到目标位置的最短路径。注意，对于计算最短路径，我们可以访问整个环境的映射，而系统的输入只是一个RGB映像。

3) A3C [3] is an asynchronous advantage actor-critic model that achieves the state-of-the-art results in Atari games. Empirical results show that using more threads improves the data efficiency during training. We thus evaluate A3C model in two setups, where we use 1 thread and 4 threads to train for each target.

3)A3C[3]是一个异步优势的演员-评论家模型，在雅达利游戏中达到了最先进的结果。实验结果表明，在训练过程中使用更多的线程可以提高数据效率。因此，我们在两个配置中评估A3C模型，其中我们使用1个线程和4个线程为每个目标进行培训。

4) One-step Q [3] is an asynchronous variant of deepQ-network [2].

4)单步Q[3]是深度Q网络[2]的异步变体。

5) Target-driven single branch is a variation of our deep siamese model that does not have scene-specific branches. In this case, all targets will use and update the same scene-specific parameters, including two FC layers and the policy/value output layers.

5)目标驱动的单分支是我们的深度暹罗模型的一个变种，它没有场景特定的分支。在这种情况下，所有目标将使用和更新相同的场景特定参数，包括两个FC层和策略/值输出层。

6) Target-driven final is our deep siamese actor-critic model introduced in Sec. IV-D.

6)目标驱动的最后是我们在IV-D部分引入的深层暹罗演员-评论家模型。

For all learning models, we report their performance after being trained with 100M frames (across all threads). The performance is measured by the average trajectory length (i.e., number of steps taken) over all targets. An episode ends when either the agent reaches the target, or after it takes 10,000 steps. For each target, we randomly initialize the agent s starting locations, and evaluate 10 episodes. The results are listed in Table I.

对于所有的学习模型，我们在使用100万帧(跨所有线程)训练后报告它们的性能。性能由所有目标的平均轨迹长度来测量(即，所采取的步骤的数目)。当代理到达目标时或执行10,000步后，一个片段就结束了。对于每个目标，我们随机初始化代理的起始位置，并评估10段。结果列在表I中。

Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning学习笔记

We analyze the data efficiency of learning with the learning curves in Fig. 5. Q-learning suffers from slow convergence. A3C performs better than Q-learning; plus, increasing the number of actor-learning threads per target from 1 to 4 improves learning efficiency. Our proposed target-driven navigation model significantly outperforms standard deep RL models when it uses 100M frames for training. We hypothesize that this is because both the weight sharing scheme across targets and the asynchronous training protocol facilitate learning generalizable knowledge. In contrast, purpose-built RL models are less data-efficient, as there is no straightforward mechanism to share information across different scenes or targets. The average trajectory length of the final model is three times shorter than the one of the single branch model. It justifies the use of scene-specific layers, as it captures particular characteristics of a scene that may vary across scene instances.

我们利用图5中的学习曲线来分析学习的数据效率。q学习存在收敛速度慢的问题。A3C优于Q-learning;此外，将每个目标的actor-learning线程数量从1增加到4可以提高学习效率。我们提出的目标驱动导航模型在使用100M帧进行训练时，其性能明显优于标准的深度RL模型。我们假设这是因为跨目标的权重共享方案和异步训练协议都有助于学习可概括的知识。相比之下，专门构建的RL模型数据效率较低，因为没有直接的机制可以在不同的场景或目标之间共享信息。最后一个模型的平均轨迹长度是单支模型的三倍。它证明了场景特定层的使用是合理的，因为它捕获了场景的特定特征，而场景实例之间可能有所不同。

Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning学习笔记

To understand what the model learns, we examine the embeddings learned by generic siamese layers. Fig. 6 shows t-SNE visualization [49] of embedding vectors computed from observations at different locations at four different orientations. We observe notable spatial correspondence between the spatial arrangement of these embedding vectors and their corresponding t-SNE projections. We therefore hypothesize that the model learns to project observation images into the embedding space while preserving their spatial configuration. To validate this hypothesis, we compare the distance of pairwise projected embeddings and the distance of their corresponding scene coordinates. The Pearson correlation coefficient is 0.62 with p-value less than 0.001, indicating that the embedding space preserves information of the original locations of observations. This means that the model learns a rough map of the environment and has the capability of localization with respect to this map.

为了理解模型学习的内容，我们研究了由一般暹罗层学习的嵌入。图6为四种不同方位不同位置的观测值计算得到的嵌入向量的t-SNE可视化[49]。我们观察到这些嵌入向量的空间排列与其对应的t-SNE投影之间存在显著的空间对应关系。因此，我们假设该模型学习将观测图像投影到嵌入空间中，同时保持其空间构型。为了验证这个假设，我们比较了成对投影嵌入的距离和它们对应的场景坐标的距离。Pearson相关系数为0.62,p值小于0.001，说明嵌入空间保留了原始观测位置的信息。这意味着模型学习环境的一个粗略匹配，并具有与此匹配相关的本地化能力。

Pearson相关系数

Pearson相关系数（Pearson CorrelationCoefficient）是用来衡量两个数据集合是否在一条线上面，它用来衡量定距变量间的线性关系。

相关系数的绝对值越大，相关性越强：相关系数越接近于1或-1，相关度越强，相关系数越接近于0，相关度越弱。[6]

t-SNE

（t-SNE）t-分布式随机邻域嵌入是一种用于挖掘高维数据的非线性降维算法。它将多维数据映射到适合于人类观察的两个或多个维度。在t-SNE算法的帮助下，你下一次使用高维数据时，可能就不需要绘制很多探索性数据分析图了。

t-SNE与其他降维算法

以下是几个你可以查找到的降维算法：

1.主成分分析（线性）

2.t-SNE（非参数/非线性）

3.萨蒙映射（非线性）

4.等距映射（非线性）

5.局部线性嵌入(非线性)

6.规范相关分析（非线性）

7.SNE(非线性)

8.最小方差无偏估计（非线性）

9.拉普拉斯特征图（非线性）

你只需要学习上述算法中的其中两种，就可以有效地在较低维度上使数据可视化 - PCA和t-SNE

PCA的局限性

PCA是一种线性算法。它不能解释特征之间的复杂多项式关系。另一方面，t-SNE是基于在邻域图上随机游走的概率分布，可以在数据中找到其结构关系。

线性降维算法的一个主要问题是它们集中将不相似的数据点放置在较低维度区域时，数据点相距甚远。但是为了在低维、非线性流型上表示高维数据，我们也需要把相似的数据点靠近在一起展示，这并不是线性降维算法所能做的。

局部方法寻求将流型上的附近点映射到低维表示中的附近点。另一方面，全局方法试图保留所有尺度的几何形状，即将附近的点映射到附近的点，将远处的点映射到远处的点

要知道，除t-SNE之外的大多数非线性技术都不能同时保留数据的局部和全局结构。

t-SNE工作原理

非线性降维算法t-SNE通过基于具有多个特征的数据点的相似性识别观察到的模式来找到数据中的规律。它不是一个聚类算法，而是一个降维算法。这是因为当它把高维数据映射到低维空间时，原数据中的特征值不复存在。所以不能仅基于t-SNE的输出进行任何推断。因此，本质上它主要是一种数据探索和可视化技术。

但是t-SNE可以用于分类器和聚类中，用它来生成其他分类算法的输入特征值。[5]

B. Generalization Across Targets

In addition to the data-efficient learning of our targetdriven models, our model has the built-in ability to generalize, which is a significant advantage over the purposebuilt baseline models. We evaluate its generalization ability in two dimensions: 1. generalizing to new targets within one scene, and 2. generalizing to new scenes.We focus on generalization across targets in this section, and explain scene generalization in Sec. V-C.

B.跨目标泛化

除了目标驱动模型的数据高效学习外，我们的模型还具有内置的泛化能力，这是相对于有目的构建的基线模型的一个显著优势。我们从两个维度来评价其泛化能力:1. 对一个场景内的新目标进行泛化；2. 概括到新的场景。在本节中，我们将重点讨论目标间的泛化，并在第V-C节中解释场景泛化。

We test the model to navigate to new targets that are excluded from training. We take 10 of the largest scenes in our dataset, each having around 15 targets. We gradually increase the number of trained targets (from 1, 2, 4 to 8) using our target-driven model. All models are trained with 20M frames. During testing, we run 100 episodes for each of 10 new targets. These new targets are randomly chosen from a set of locations that have a constant distance (1, 2, 4 and 8 steps) from the nearest trained targets. The results are illustrated in Fig. 7. We use success rate (percentage of trajectories shorter than 500 steps) to measure the performance. We choose this metric due to the bipolar behavior of our model on new targets it either reaches the new targets quickly, or fails completely. Thus, this metric is more effective than average trajectory lengths. In Fig. 7, we observe a consistent trend of increasing success rate, as we increase the number of trained targets (x-axis). Inside each histogram group, the success rate positively correlates with adjacency between trained and new targets. It indicates that the model has a clearer understanding of nearby regions around the trained targets than distant locations.

我们对模型进行测试，以导航到排除在训练之外的新目标。我们在数据集中选取了10个最大的场景，每个场景大约有15个目标。我们使用目标驱动模型逐步增加训练目标的数量(从1、2、4个增加到8个)。所有的模型都经过20M框架的训练。在测试期间，我们为10个新目标中的每一个运行100个集。这些新目标是从一组距离最近的训练目标有恒定距离(1、2、4和8步)的位置中随机选择的。结果如图7所示。我们使用成功率(小于500步的轨迹百分比)来衡量性能。我们选择这个指标是因为我们的模型对新目标的双极性行为，它要么快速到达新目标，要么完全失败。因此，这个度量比平均轨迹长度更有效。在图7中，我们观察到随着训练目标数量(x轴)的增加，成功率呈一致的增加趋势。在每个直方图组内，成功率与训练目标与新目标之间的邻接度呈正相关。这表明，该模型对训练目标附近区域的理解比对远处区域的理解更为清晰。

C. Generalization Across Scenes

We further evaluate our model's ability to generalize across scenes. As the generic siamese layers are shared over all scenes, we examine the possibility of transferring knowledge from these layers to new scenes. Furthermore, we study how the number of trained scenes would influence the transferability of generic layer parameters. We gradually increase the number of trained scenes from 1 to 16, and test on 4 unseen scenes. We select 5 random targets from each scene for training and testing. To adapt to unseen scenes, we train the scene-specific layers while fixing generic siamese layers. Fig. 8 shows the results. We observe faster convergence as the number of trained scenes grows.

我们进一步评估模型泛化场景的能力。由于一般暹罗语层在所有场景中都是共享的，我们研究了将知识从这些层转移到新场景的可能性。此外，我们还研究了训练场景的数量如何影响一般层参数的可移植性。我们逐步将训练场景的数量从1个增加到16个，并对4个不可见场景进行测试。我们从每个场景中随机选择5个目标进行训练和测试。为了适应看不见的场景，我们在固定一般暹罗图层的同时训练场景特定的图层。结果如图8所示。我们观察到，随着训练场景数量的增加，收敛速度加快。

Compared to training from scratch, transferring generic layers significantly improves data efficiency for learning in new environments. We also evaluate the single branch model in the same setup. As the single branch model includes a single scene-specific layer, we can apply a trained model (trained on 16 scenes) to new scenes without extra training. However, it results in worse performance than chance, indicating the importance of adapting scene-specific layers. The single branch model leads to slightly faster convergence than training from scratch, yet far slower than our final model.

与从零开始的训练相比，转移通用层显著提高了在新环境中学习的数据效率。我们还在相同的设置中评估了单个分支模型。由于单个分支模型包含一个场景特定的层，我们可以将一个训练过的模型(训练在16个场景上)应用到新的场景中，而不需要额外的训练。然而，这导致了比随机更差的性能，这表明了调整场景特定层的重要性。单分支模型比从零开始的训练收敛得稍微快一些，但比最终的模型慢得多。

D. Continuous Space

The space discretization eliminates the need for handling complex system dynamics, such as noise in motor control. In this section, we show empirical results that the same learning model is capable of coping with more challenging continuous space.

d .连续空间

空间离散化消除了复杂系统动力学处理的需要，如电机控制中的噪声。在本节中，我们的实证结果表明，相同的学习模型能够应对更具挑战性的连续空间。

To illustrate this, we train the same target-driven model for a door-finding task in a large living room scene, where the goal is to arrive at the balcony through a door. We use the same 4 actions as before (see Sec. IV-C); however, the agent‘s moves and turns are controlled by the physics engine. In this case, the method should explicitly handle forces and collisions, as the agent may be stopped by obstacles or slide along heavy objects. Although this setting requires significantly more training frames (around 50M) to train for a single target, the same model learns to reach the door in average 15 steps, whereas random agents take 719 steps on average. We provide sample test episodes in the supplementary video.

为了说明这一点，我们在一个大型客厅场景中训练了相同的目标驱动模型，目标是通过一扇门到达阳台。我们使用与前面相同的4个操作(参见第IV-C节);然而，智能体的运动和转弯是由物理引擎控制的。在这种情况下，该方法应该显式地处理力和碰撞，因为代理可能会被障碍物阻止或沿着重物滑动。虽然这个设置需要更多的训练框架(大约50M)来训练单个目标，但是相同的模型学习到达门平均需要15步，而随机代理平均需要719步。我们在补充视频中提供了样例测试集。

E. Robot Experiment

To validate the generalization of our method to real world settings, we perform an experiment by using a SCITOS mobile robot modified by [50] (see Fig. 9). We train our model in three different settings: 1) training on real images from scratch; 2) training only scene-specific layers while freezing generic layer parameters trained on 20 simulated scenes; and 3) training scene-specific layers and fine-tuning generic layer parameters.

e .机器人实验

为了验证我们的方法在现实环境中的推广效果，我们使用经过[50]修改的SCITOS移动机器人进行了实验(如图9所示)。我们在三种不同的环境下训练我们的模型:1)从零开始训练真实的图像;2)只训练场景特定的图层，冻结20个模拟场景训练的通用图层参数;3)训练场景特定图层，微调通用图层参数。

We train our model (with backward action disabled) on 28 discrete locations in the scene, which are roughly 30 inches apart from each other in each dimension. At each location, the robot takes 4 RGB images (90 degrees apart) using its head camera. During testing, the robot moves and turns based on the model’s predictions. We evaluate the robot with two targets in the room: door and microwave. Although the model is trained on a discretized space, it exhibits robustness towards random starting points, noisy dynamics, varying step lengths, changes in illumination and object layouts, etc. Example test episodes are provided in the supplementary video. All three setups converge to nearlyoptimal policy due to the small scale of the real scene. However, we find that transferring and fine-tuning parameters from simulation to real data offers the fastest convergence out of these three setups (44% faster than training from scratch). This provides supportive evidence on the value of simulations in learning real-world interactions and shows the possibility of generalization from simulation to real images using a small amount of fine-tuning.

我们在场景中的28个离散位置上训练我们的模型(禁用后退动作)，这些位置在每个维度上相距大约30英寸。在每个位置，机器人用头部摄像头拍摄4张RGB图像(90度)。在测试过程中，机器人根据模型的预测来移动和转弯。我们评估了机器人在房间里的两个目标:门和微波。虽然该模型是在离散空间上训练的，但它对随机起点、噪声动力学、步长变化、光照变化和对象布局等具有鲁棒性。补充视频中提供了示例测试集。由于实际场景的规模较小，这三种设置都收敛到近乎最优的策略。然而，我们发现从模拟到真实数据的传输和微调参数提供了这三种设置中最快的收敛速度(比从零开始训练快44%)。这为仿真在学习真实世界交互中的价值提供了支持证据，并显示了使用少量微调将仿真推广到真实图像的可能性。

VI. CONCLUSIONS

We proposed a deep reinforcement learning (DRL) framework for target-driven visual navigation. The state-of-theartDRL methods are typically applied to video games and environments that do not mimic the distribution of natural images. This work is a step towards more realistic settings.
The state-of-the-art DRL methods have some limitations that prevent them from being applied to realistic settings.In this paper, we addressed some of these limitations. We addressed generalization across scenes and targets, improved data efficiency compared to the state-of-the-art DRL methods, and provided AI2-THOR framework that enables inexpensive and efficient collection of action and interaction data.
Our experiments showed that our method generalizes to new targets and scenes that are not used during the endto- end training of the model. We also showed our method converges with much fewer training samples compared to the state-of-the-art DRL methods. Furthermore, we showed that the method works in both discrete and continuous domains.

六.结论

提出了一种用于目标驱动视觉导航的深度强化学习(DRL)框架。最先进的drl方法通常应用于不模拟自然图像分布的视频游戏和环境。这项工作是朝着更现实的设置迈出的一步。

最先进的DRL方法有一些限制，阻止他们被应用到现实的设置。在本文中，我们解决了其中的一些限制。与最先进的DRL方法相比，我们解决了场景和目标间的泛化问题，提高了数据效率，并提供了AI2-THOR框架，该框架能够廉价高效地收集动作和交互数据。

实验表明，该方法可推广到模型训练过程中未使用的新目标和场景。我们还表明，与最先进的DRL方法相比，我们的方法收敛的训练样本要少得多。此外，我们证明了该方法在离散域和连续域都是有效的。

We also showed that a model that is trained on simulation can be adapted to a real robot with a small amount of fine-tuning. We provided visualizations that show that our DRL method implicitly performs localization and mapping. Finally, our method is end-to-end trainable. Unlike the common visual navigation methods, it does not require explicit feature matching or 3D reconstruction of the environment.

我们也证明了一个经过仿真训练的模型可以通过少量的微调来适应真实的机器人。我们提供了可视化，显示我们的DRL方法隐式地执行本地化和映射。最后，我们的方法是端到端可训练的。与常见的视觉导航方法不同，它不需要显式的特征匹配或环境的三维重建。

Our future work includes increasing the number of highquality 3D scenes in our framework. Additionally, we plan to build models that learn the physical interactions and object manipulations in the framework.

我们未来的工作包括在我们的框架中增加高质量3D场景的数量。此外，我们计划构建模型来学习框架中的物理交互和对象操作。

参考文献：

[1] https://www.jianshu.com/p/92d7f6eaacf5

[2] https://baike.baidu.com/item/ImageNet

[3] https://blog.csdn.net/donkey_1993/article/details/84563530

[4] https://blog.csdn.net/u013181595/article/details/80990930

[5] https://www.analyticsvidhya.com/blog/2017/01/t-sne-implementation-r-python/

[6] https://baike.baidu.com/item/Pearson%E7%9B%B8%E5%85%B3%E7%B3%BB%E6%95%B0/6243913?fr=aladdin