Unity ML-Agents Toolkit v0.8：在真实游戏中进行更快的培训

Today, we are releasing a new update to the Unity ML-Agents Toolkit that enables faster training by launching multiple Unity simulations running on a single machine. This upgrade will enable game developers to create character behaviors by significantly speeding up training of Deep Reinforcement Learning algorithms.

今天，我们发布了Unity ML-Agents Toolkit的新更新，该更新通过启动在一台机器上运行的多个Unity仿真来加快培训速度。此次升级将使游戏开发人员能够通过显着加快深度强化学习算法的培训来创建角色行为。

In this blog post, we overview our work with our partner JamCity to train agents to play advanced levels of their Bubble Shooter, Snoopy Pop. Release v0.8 of the Unity ML-Agents Toolkit enabled them to train an agent to play a level on a single machine 7.5 times faster than was previously possible. Our work doesn’t stop here; we are also working on techniques to train multiple levels concurrently by scaling out training across multiple machines.

在此博客文章中，我们概述了与合作伙伴 JamCity 的合作，以培训代理商播放其泡泡射击游戏 Snoopy Pop的高级水平。 Unity ML-Agents工具包v0.8发行版使他们能够训练代理在单台机器上玩游戏的速度比以前快了7.5倍。我们的工作不止于此。我们还致力于通过扩展多台机器上的培训来同时培训多个级别的技术。

One of our core guiding principles since first releasing the Unity ML-Agents Toolkit has been to enable game developers to leverage Deep Reinforcement Learning (DRL) to develop behaviors for both playable and non-playable characters. We previously showed how DRL can be used to learn a policy for controlling Puppo using physics-based animations. However, real games are complex and DRL algorithms are computationally intensive and require a large volume of gameplay data in order to learn. Most DRL research leverages very lightweight games that can be sped up greatly (to generate gameplay data faster), whereas real games typically have constraints which require them to run at normal speed (or limit the amount of speed-up). This led us to focus on improving training on the most accessible computation platform available to a developer, their local development machine.

我们的一个核心指导原则，因为首先释放的团结ML-代理工具包已经让游戏开发者能利用深层强化学习(DRL)制定的两个可玩和非游戏人物的行为。先前，我们展示了如何使用DRL来学习基于物理的动画来控制 Puppo 的策略。但是，实际游戏非常复杂，DRL算法的计算量很大，并且需要大量的游戏数据才能学习。大多数DRL研究都利用非常轻量级的游戏，这些游戏可以大大加快速度(以更快地生成游戏数据)，而真实游戏通常会受到限制，要求它们以正常速度运行(或限制加速量)。这导致我们专注于改进开发人员可在其本地开发计算机上使用的最易访问的计算平台上的培训。

Creating emergent behaviors using DRL involves learning the weights of a neural network that represent a policy, a mapping from the agent’s observation to an action. Learning is accomplished by executing the policy on one or more simulation instances and using the output to update the weights in a manner that maximizes the agent’s reward. Training completes faster when we have more instances on which the policy is evaluated. Today, we are introducing the ability to train faster by having multiple concurrent instances of Unity on a multi-core machine. To illustrate the importance of utilizing multi-core machines in order to train agents in real games, we’ve partnered with JamCity and the Snoopy Pop game team. The changes we provide in v0.8 enable a training speedup of 5.5x on easy levels and up to 7.5x on harder levels by leveraging 16 Unity simulations. Generally speaking, the gains of utilizing multiple Unity simulations are greater for more complex levels and games.

使用DRL创建紧急行为需要学习代表策略的神经网络的权重，即从代理观察到行动的映射。学习是通过在一个或多个模拟实例上执行策略并使用输出以最大化代理奖励的方式更新权重来完成的。当我们有更多实例可以评估策略时，培训会更快地完成。今天，我们引入了通过在多核机器上具有多个并发的Unity实例来加快训练速度的功能。为了说明利用多核计算机对真实游戏中的代理商进行培训的重要性，我们与 JamCity 和史努比流行游戏团队合作。通过在v0.8中提供的更改，我们可以利用16个Unity模拟在简单级别上将训练速度提高5.5倍，在更困难级别上将训练速度提高到7.5倍。一般来说，对于更复杂的关卡和游戏，利用多个Unity模拟的收益更大。

The improvements in this update of the Unity ML-Agents Toolkit will both enable you to fully utilize the resources of your development machine, as well as greatly speed-up training by leveraging a multi-core machine on a cloud provider such as Google Cloud Platform. We’ve additionally been experimenting and building internal infrastructure to scale out training across multiple machines to enable learning a single policy that can solve many levels of Snoopy Pop at the same time. The video below demonstrates a single, trained agent playing through increasingly difficult levels of Snoopy Pop.

Unity ML-Agents工具包的此更新中的改进将使您能够充分利用开发计算机的资源，并通过利用云提供商(例如Google Cloud Platform)上的多核计算机来极大地加快培训速度。我们还一直在尝试并构建内部基础结构，以扩展多台计算机上的培训，从而使您能够学习可以同时解决多个级别的史努比流行音乐的单个策略。下面的视频演示了 一个受过训练的 特工，正在通过难度越来越高的史努比流行音乐进行游戏。

A single trained agent playing multiple levels of Snoopy Pop

一个 单一的 训练有素的代理人打史努比流行的多层次

ML-Agents工具包+史努比流行音乐 (ML-Agents Toolkit + Snoopy Pop )

Snoopy Pop is a bubble shooter created by JamCity. In Snoopy Pop, the player needs to free the character Woodstock and his flock of birds by popping bubbles. The player can shoot a bubble at a particular angle or switch the color of the bubble before shooting. When the bubble sticks onto the same type of bubble and forms a group of more than three, the group will vanish, the bird in the bubble will be freed, and the player will improve their score. The player completes the level when all of the birds on the board are freed. Conversely, a player loses when they deplete all of the bubbles in their bag. Our goal is to train an agent that can play the game as the player would, and reach the highest level possible.

史努比流行音乐是由JamCity创建的泡泡射击游戏。在史努比流行音乐中，玩家需要通过弹出气泡来释放角色伍德斯托克和他的鸟群。玩家可以在拍摄前以特定角度拍摄泡泡或切换泡泡的颜色。当气泡粘在相同类型的气泡上并形成一个多于三个的组时，该组将消失，气泡中的小鸟将被释放，玩家将提高得分。当棋盘上的所有小鸟都被释放后，玩家便完成了关卡。相反，当玩家耗尽包中的所有气泡时，他们便输了。我们的目标是训练能够像玩家一样玩游戏并达到最高水平的经纪人。

Unity ML-Agents Toolkit v0.8：在真实游戏中进行更快的培训

Snoopy must clear the bubbles containing Woodstock and his flock

史努比必须清除包含伍德斯托克和他的羊群的气泡

Using the ML-Agents Toolkit, an agent carries out a policy by receiving observations representing the game state and taking actions based on them. To solve Snoopy Pop using DRL, we first need to define these observations and actions, as well as the reward function which the policy attempts to maximize. As observations, the agent receives a simplified, low-resolution (84×84) version of the game board and the bubbles it is holding. The agent can then choose to shoot the bubble along 21 different angles or swap the bubble before shooting. After the bubble is shot and collides with (or does not collide with) other bubbles, the agent is rewarded for increasing the score, freeing birds, and winning. Negative rewards are also given for each bubble shot (to encourage the agent to solve the level quickly) and for losing.

代理使用ML-Agents工具包，通过接收代表游戏状态的观察并基于这些观察采取行动来执行策略。为了使用DRL解决史努比流行音乐，我们首先需要定义这些观察和操作以及策略尝试最大化的奖励功能。观察到，该代理会收到简化的低分辨率(84×84)版本的游戏板及其持有的气泡。然后，代理可以选择沿21个不同的角度射击气泡，或在射击之前交换气泡。射出泡泡并与其他泡泡发生碰撞(或不与之碰撞)后，该特工将获得奖励，以增加得分，释放鸟类并获胜。每次泡泡射击(也鼓励代理商Swift解决问题)和失败时，也会给予负奖励。

Observations, Actions, and Rewards defined for Snoopy Pop

为史努比流行音乐定义的观察，动作和奖励

The use of visual observations with a large action space makes Snoopy Pop levels difficult to solve. For a simple level, the agent needs to take more than 80,000 actions to learn an effective policy. More difficult levels may take half a million actions or more.

使用具有较大动作空间的视觉观察使史努比流行音乐的级别难以解决。简单来说，代理需要采取80,000多个操作来学习有效的策略。较困难的级别可能需要执行上百万个动作或更多。

Additionally, the game uses physics to simulate how the bubbles bounce and collide with other bubbles, making it difficult to change the timescale without substantially changing the dynamics of the game. Even at 5x timescale, we can only collect about two actions a second. This means that it would take over 11 hours to solve a simple level and several days to solve more difficult ones. This makes it critical to scale out the data collection process by launching multiple, concurrent Unity simulations, to best maximize the machine’s resources.

此外，游戏使用物理学来模拟气泡如何反弹并与其他气泡碰撞，从而在不大幅改变游戏动力学的情况下很难更改时间范围。即使是5倍的时间尺度，我们每秒只能收集大约两个动作。这意味着解决一个简单的关卡将花费 11个小时 以上，而解决更困难的关卡则需要几天的时间。这对于通过启动多个并发的Unity仿真来扩展数据收集过程至关重要，以最大程度地最大化机器的资源。

运行多个并发的Snoopy Pop实例 (Running multiple, concurrent instances of Snoopy Pop)

While we are limited in how much we can speed up a single instance of Snoopy Pop, multi-core processors allow us to run multiple instances on a single machine. Since each play-through of the game is independent, we can trivially parallelize the collection of our training data.

虽然我们可以加快单个Snoopy Pop实例的速度受到限制，但是多核处理器允许我们在一台计算机上运行多个实例。由于游戏的每个过程都是独立的，因此我们可以简单地并行收集训练数据。

Each simulation feeds data into a common training buffer, which is then used by the trainer to update its policy in order to play the game better. This new paradigm allows us to collect much more data without having to change the timescale or any other game parameters which may have a negative effect on the gameplay mechanics. We believe this is the first necessary step in order to bring higher performance training to users of the ML-Agents Toolkit.

每次模拟都会将数据输入一个通用的训练缓冲区，然后由训练人员使用该缓冲区来更新其策略以更好地玩游戏。这种新的范例使我们可以收集更多的数据，而不必更改时间尺度或任何其他可能对游戏机制产生负面影响的游戏参数。我们认为，这是向ML-Agents工具包用户提供更高性能培训的第一步。

绩效结果 (Performance results)

To demonstrate the utility of launching multiple, concurrent Unity simulations we’re sharing training times for two different levels of Snoopy Pop (Level 2 and 25). More specifically, we recorded the training time across a varying number of Unity simulations. Since each additional concurrent environment has a small coordination overhead, we expect diminishing returns as we scale further. Additionally, for simple levels or games, adding more Unity simulations may not improve performance as the gameplay data generated from those additional simulations will be highly correlated with existing gameplay data and thus won’t provide a benefit to the training algorithm. To summarize, expect diminishing returns as you add more Unity simulations, where the diminishing rate depends on the difficulty of the level or game on which the model is being trained.

为了演示启动多个并发Unity仿真的实用性，我们共享了两个不同级别的Snoopy Pop(第2级和第25级)的训练时间。更具体地说，我们记录了各种Unity模拟的训练时间。由于每个其他并发环境的协调开销都很小，因此随着进一步扩展，我们期望收益会减少。此外，对于简单的关卡或游戏，添加更多的Unity模拟可能不会改善性能，因为从这些附加模拟生成的游戏数据将与现有游戏数据高度相关，因此不会为训练算法带来好处。总而言之，随着添加更多Unity模拟，期望收益会递减，递减率取决于训练模型的关卡或游戏的难度。

The first graph below shows the training time for v0.8 release to solve level 2 of Snoopy Pop within the range of 1 and 16 parallel environments. We took the average time across 3 runs since randomness in the training process can significantly change the time from run to run. You’ll notice we see a very large performance boost when scaling from one to two environments and then steady but sub-linear scaling after that with a 5.5x improvement when using 16 environments versus 1 environment.

下面的第一张图显示了v0.8版本在1和16个并行环境范围内解决史努比流行音乐2级的训练时间。由于训练过程中的随机性可以显着改变跑步的时间，因此我们将平均时间跨3次跑步。您会注意到，当从一个环境扩展到两个环境时，我们看到了非常大的性能提升，然后在进行稳定但亚线性的扩展后，使用16个环境与1个环境相比提高了 5.5 倍。

We also find that the effects of training with parallel environments becomes more relevant on levels of Snoopy Pop. This is due to the fact that with more difficult levels, the experiences generated across the multiple simulations are more independent (and thus beneficial to the training process) than for simpler levels. Here is a graph comparing the performance of our v0.8 release on level 25 of Snoopy Pop. Note that there is an almost 7.5x improvement in using 16 environments compared to 1 environment.

我们还发现，在史努比流行音乐级别上，使用并行环境进行训练的效果变得更加相关。这是由于以下事实：与较简单的级别相比，在难度更高的级别上，跨多个模拟生成的体验更加独立(从而对培训过程有益)。这是一张图表，比较了我们的v0.8版本在史努比流行音乐25级上的性能。请注意，与使用1个环境相比，使用16个环境几乎提高了 7.5倍 。

Today’s release of ML-Agents Toolkit v0.8 supports training with multiple, concurrent Unity simulations on a single machine. If you have an existing environment you’ll just need to update to the latest version of the ML-Agents Toolkit and re-build your game. After upgrading, you’ll have access to a new option for the mlagents-learn tool which allows you to specify the number of parallel environments you’d like to run. See our documentation for more information.

今天发布的 ML-Agents Toolkit v0.8 支持在一台机器上同时进行多个Unity仿真的培训。如果您拥有现有环境，则只需更新到ML-Agents Toolkit的最新版本并重新构建游戏。升级后，您将可以访问 mlagents-learn 工具的新选项，该选项可让您指定要运行的并行环境的数量。有关更多信息，请参见我们的文档。

其他更新 (Other updates )

In addition to the ability to launch multiple Unity simulations, this update of the ML-Agents Toolkit comes with a few bonus features.

除了可以启动多个Unity仿真之外，此ML-Agents Toolkit的更新还具有一些额外功能。

自定义协议缓冲区消息 (Custom protocol buffer messages)

Many researchers need the ability to exchange structured data between Python and Unity outside of what is included by default. In this release, we’ve created an API which allows any developer to create custom protocol buffer messages and use them as observations, actions, or reset parameters.

许多研究人员需要能够在Python和Unity之间交换默认包含的结构化数据之外的功能。在此版本中，我们创建了一个API，该API允许任何开发人员创建自定义协议缓冲区消息并将其用作观察，操作或重置参数。

渲染纹理观察 (Render texture observations)

In addition to Visual Observations with Cameras, we’ve also included the ability to use RenderTexture. This will enable users to render textures for Visual Observations in ways other than using a camera, such as 2D Sprites, webcam, or other custom implementations.

除了使用相机进行视觉观察外，我们还包括使用RenderTexture的功能。这将使用户能够以不同于使用相机的方式(例如2D Sprite，网络摄像头或其他自定义实现)来渲染“视觉观察”的纹理。

2D射线投射 (2D ray casting)

Many users have asked about using ray casting in their 2D games. In this release, we have refactored RayPerception and added support for 2D ray casting (RayPerception2D).

许多用户询问有关在其2D游戏中使用射线投射的问题。在此版本中，我们重构了RayPerception，并增加了对2D射线投射(RayPerception2D)的支持。

多个Python包 (Multiple Python packages)

We have split the mlagents Python package into two separate packages (mlagents.trainers and mlagents.envs). This will allow users to decouple version dependencies, like TensorFlow, and make it easier for researchers to use Unity environments without having to disrupt their pre-existing Python configurations.

我们已将mlagents Python软件包分为两个单独的软件包(mlagents.trainers和mlagents.envs)。这将使用户能够解耦版本依赖关系(例如TensorFlow)，并使研究人员更容易使用Unity环境，而不必破坏他们先前存在的Python配置。

感谢我们的贡献者 (Thanks to our contributors)

The Unity ML-Agents Toolkit is an open-source project that has greatly benefited from community contributions. Today, we want to thank the external contributors who have made enhancements that were merged into this release: @pyjamads for render texture, @Tkggwatson for the optimization improvements, @malmaud for the custom protocol buffer feature, and @LeSphax for the video recorder, @Supercurious / @rafvasq / @markovuksanovic / @borisneal / @dmalpica for various improvements.

Unity ML-Agents Toolkit是一个开源项目，已从社区的贡献中大大受益。今天，我们要感谢那些已合并到此发行版中的增强功能的外部贡献者： @pyjamads 用于渲染纹理， @Tkggwatson 用于优化改进， @malmaud 用于自定义协议缓冲区功能，以及 @LeSphax用于录像机， @Supercurious / @rafvasq / @markovuksanovic / @borisneal / @dmalpica 进行了各种改进。

下一步 (Next steps)

This release of the Unity ML-Agents Toolkit enables you to train agents faster on a single machine. We intend to continue investing in this area and release future updates that will enable you to better maximize the resource usage on a single machine.

此版本的Unity ML-Agents Toolkit 使您能够在一台机器上更快地训练代理。我们打算继续在这一领域进行投资，并发布将来的更新，使您能够更好地最大限度地利用单台计算机上的资源。

If you’d like to work on this exciting intersection of Machine Learning and Games, we are hiring for several positions, please apply!

如果您想在机器学习和游戏这个令人兴奋的交叉领域工作，我们正在招聘几个职位，请申请！

If you use any of the features provided in this release, we’d love to hear from you. For any feedback regarding the Unity ML-Agents Toolkit, please fill out the following survey and feel free to email us directly. If you encounter any issues or have questions, please reach out to us on the ML-Agents GitHub issues page.

如果您使用此版本中提供的任何功能，我们很乐意收到您的来信。有关Unity ML-Agents工具包的任何反馈，请填写以下调查表，并随时直接给我们发送电子邮件。如果您遇到任何问题或疑问，请在 ML-Agents GitHub问题页面上与我们联系。

翻译自: https://blogs.unity3d.com/2019/04/15/unity-ml-agents-toolkit-v0-8-faster-training-on-real-games/