使用ML-Agents对您的代理商进行7倍的培训

In v0.9 and v0.10 of ML-Agents, we introduced a series of features aimed at decreasing training time, namely Asynchronous Environments, Generative Adversarial Imitation Learning (GAIL), and Soft Actor-Critic. With our partner JamCity, we previously showed that the parallel Unity instance feature introduced in v0.8 of ML-Agents enabled us to train agents for their bubble shooter game, Snoopy Pop, 7.5x faster than with a single instance. In this blog post, we will explain how v0.9 and v0.10 build on those results and show that we can decrease Snoopy Pop training time by an additional 7x, enabling more performant agents to be trained in a reasonable time.

在ML-Agents的 v0.9和v0.10中，我们引入了一系列旨在减少训练时间的功能，即异步环境，生成性对抗模仿学习(GAIL)和软演员批评。与我们的合作伙伴 JamCity一起，我们先前证明了ML-Agents v0.8中引入的并行Unity实例功能使我们能够为其泡泡射击游戏 Snoopy Pop 训练代理，其速度是单个实例的7.5倍。在此博客文章中，我们将说明v0.9和v0.10如何基于这些结果，并显示我们可以将Snoopy Pop的培训时间再减少 7倍，从而可以在合理的时间内培训更多表现出色的特工。

The purpose of the Unity ML-Agents Toolkit is to enable game developers to create complex and interesting behaviors for both playable and non-playable characters using Deep Reinforcement Learning (DRL). DRL is a powerful and general tool that can be used to learn a variety of behaviors, from physics-based characters to puzzle game solvers. However, DRL requires a large volume of gameplay data to learn effective behaviors– a problem for real games that are typically constrained in how much they can be sped up.

Unity ML-Agents工具包的目的是使游戏开发人员能够使用深度强化学习(DRL)为可玩和不可玩角色创建复杂而有趣的行为。 DRL是功能强大且通用的工具，可用于学习各种行为，从基于物理的角色到益智游戏求解器。但是，DRL需要大量的游戏数据来学习有效的行为-对于真实游戏而言，通常会限制其可加速性的问题。

Several months ago, with the release of ML-Agents v0.8, we introduced the ability for ML-Agents to run multiple Unity instances of a game on a single machine, dramatically increasing the throughput of training samples (i.e., the agent’s observations, actions, and rewards) that we can collect during training. We partnered with JamCity to train an agent to play levels of their Snoopy Pop puzzle game. Using the parallel environment feature of v0.8, we were able to achieve up to 7.5x training speed up on harder levels of Snoopy Pop.

几个月前，随着ML-Agents v0.8的发布，我们引入了ML-Agents可以在一台机器上运行一个游戏的多个Unity实例的功能，从而极大地提高了训练样本的吞吐量(例如，代理商的观察，动作和奖励)，我们可以在培训期间收集这些信息。我们与JamCity合作，训练了特工来玩他们的史努比流行益智游戏。使用v0.8的并行环境功能，我们可以在更高级别的史努比流行音乐上实现高达7.5倍的训练速度。

But parallel environments will only go so far—there is a limit to how many concurrent Unity instances can be run on a single machine. To improve training time on resource-constrained machines, we had to find another way. In general, there are two ways to improve training time: increase the number of samples gathered per second (sample throughput), or reduce the number of samples required to learn good behavior (sample efficiency). Consequently, in v0.9, we improved our parallel trainer to gather samples asynchronously, thereby increasing sample throughput.

但是并行环境只会走得那么远—一台机器上可以运行多少个并发Unity实例是有限制的。为了缩短资源受限机器上的培训时间，我们不得不寻找另一种方法。通常，有两种方法可以缩短训练时间：增加每秒收集的样本数量( 样本吞吐量 )，或减少学习良好行为所需的样本数量( 样本效率 )。因此，在v0.9中，我们改进了并行训练器以异步收集样本，从而提高了样本通量。

Furthermore, we added Generative Adversarial Imitation Learning (GAIL), which enables the use of human demonstrations to guide the learning process, thus improving sample efficiency. Finally, in v0.10, we introduced Soft Actor-Critic (SAC), a trainer that has substantially higher sample efficiency than the Proximal Policy Optimization trainer in v0.8. These changes together improved training time by another 7 times on a single machine. For Snoopy Pop, this meant that we were not only able to create agents that solve levels but agents that solved them in the same # of steps as a human player. With the increased sample throughput and efficiency, we were able to train multiple levels of Snoopy Pop on a single machine, which previously required multiple days of training on a cluster of machines. This blog post will detail the improvements made in each subsequent version of ML-Agents, and how they affected the results in Snoopy Pop.

此外，我们增加了“生成对抗模仿学习(GAIL)”功能，该功能可以利用人类示范来指导学习过程，从而提高样本效率。最后，在v0.10中，我们引入了Soft Actor-Critic(SAC)，这是一种培训器，其抽样效率比v0.8中的“近端策略优化”培训器高得多。这些更改共同在一台机器上又将培训时间缩短了7倍。对于史努比流行音乐(Snoopy Pop)，这意味着我们不仅能够创建能够解决关卡的代理，而且能够以与人类玩家相同的步骤来解决问题的代理。随着样品通量和效率的提高，我们能够在一台机器上训练多个级别的Snoopy Pop，而以前，这需要在一台机器集群上进行数天的训练。这篇博客文章将详细介绍每个后续版本的ML-Agents所做的改进，以及它们如何影响Snoopy Pop中的结果。

ML-Agents工具包+史努比流行音乐 (ML-Agents Toolkit + Snoopy Pop)

We first introduced our integration of ML-Agents with Snoopy Pop in our ML-Agents v0.8 blog post. The figure below summarizes what the agent can see, what it can do, and the rewards that it received. Note that compared to our previous experiments with Snoopy Pop, we decreased the magnitude of the positive reward and increased the penalty for using a bubble, forcing the agent to focus its attention less on simply finishing the level and more on clearing bubbles in the fewest number of steps possible, just as a human player would do. This is a much harder problem than just barely winning the level, and takes significantly longer to learn a good policy.

我们首先在ML-Agents v0.8博客文章中介绍了ML-Agent与Snoopy Pop的集成。下图汇总了座席可以看到的内容，可以做什么以及收到的奖励。请注意，与我们先前使用史努比流行音乐进行的实验相比，我们减少了正面奖励的幅度，并增加了使用泡泡的惩罚，这迫使特工将注意力集中在简单地完成关卡上，而不是集中在清除最少数量的气泡上。就像人类玩家会做的那样。这比仅仅勉强赢得关卡要困难得多，要学习好的政策需要花费更长的时间。

Figure: Observations, Actions, and Rewards defined for Snoopy Pop

图：为史努比流行音乐定义的观察，动作和奖励

ML-Agents 0.8：运行史努比流行音乐的多个并发实例 (ML-Agents 0.8: Running multiple, concurrent instances of Snoopy Pop)

In ML-Agents v0.8 , we introduced the ability to train multiple Unity instances at the same time. While we are limited in how much we can speed up a single instance of Snoopy Pop, multi-core processors allow us to run multiple instances on a single machine. Since each play-through of the game is independent, we can trivially parallelize the collection of our training data.

在ML-Agents v0.8中，我们引入了同时训练多个Unity实例的功能。虽然我们可以加快单个Snoopy Pop实例的速度受到限制，但是多核处理器允许我们在一台计算机上运行多个实例。由于游戏的每个过程都是独立的，因此我们可以简单地并行收集训练数据。

Each simulation environment feeds data into a common training buffer, which is then used by the trainer to update its policy in order to play the game better. This new paradigm allows us to collect much more data without having to change the timescale or any other game parameters which may have a negative effect on the gameplay mechanics.

每个模拟环境都将数据馈入一个公共的训练缓冲区，然后由训练人员用来更新其策略以更好地玩游戏。这种新的范例使我们可以收集更多的数据，而不必更改时间尺度或任何其他可能对游戏机制产生负面影响的游戏参数。

ML-Agents v0.9：异步环境和模仿学习 (ML-Agents v0.9: Asynchronous Environments and Imitation Learning)

In ML-Agents v0.9, we introduced two improvements to sample efficiency and sample throughput, respectively.

在ML-Agents v0.9中，我们分别对样品效率和样品通量进行了两项改进。

异步环境 (Asynchronous Environments)

In the v0.8 implementation of parallel environments, each Unity instance takes a step in sync with the others, and the trainer receives all observations and sends all actions at the same time. For some environments, such as those provided with the ML-Agents toolkit, the agents take decisions at roughly the same constant frequency, and executing them in lock-step is not a problem. However, for real games, certain actions may take longer than others. For instance, in Snoopy Pop, clearing a large number of bubbles incurs a longer animation than clearing none, and winning the game and resetting the level takes longer than taking a shot. This means that if even one of the parallel environments takes one of these longer actions, the others must wait.

在并行环境的v0.8实现中，每个Unity实例都与其他实例同步，并且培训师会接收所有观察值并同时发送所有动作。对于某些环境(例如ML-Agents工具包随附的环境)，代理以大致相同的固定频率进行决策，并且以锁定步骤执行决策不是问题。但是，对于真实游戏，某些动作可能需要更长的时间。例如，在“史努比流行音乐”中，清除大量气泡会比不清除气泡产生更长的动画，并且赢得游戏并重置关卡所花的时间比射击要长。这意味着，即使并行环境之一采取了这些较长的措施之一，其他措施也必须等待。

In ML-Agents v0.9, we enabled asynchronous parallel environments. As long as at least one of the environments have finished taking its action, the trainer can send a new action and take the next step. For environments with varying step times, this can significantly improve sample throughput.

在ML-Agents v0.9中，我们启用了异步并行环境。只要至少一个环境已完成其动作，培训师就可以发送新动作并采取下一步。对于步进时间不同的环境，这可以显着提高样品通量。

生成对抗式模仿学习(GAIL) (Generative Adversarial Imitation Learning (GAIL))

In a typical DRL training process, the agent is initialized with a random behavior, performs random actions in the environment, and may happen upon some rewards. It then reinforces behaviors that produce higher rewards, and, over time, the behavior tends towards one that maximizes the reward in the environment and becomes less random.

在典型的DRL训练过程中，代理会以随机行为进行初始化，在环境中执行随机动作，并且可能会因某些奖励而发生。然后，它加强了产生更高奖励的行为，并且随着时间的流逝，该行为趋向于在环境中最大化奖励并减少随机性的行为。

However, not all optimal behavior is easy to find through random behavior. For example, the reward may be sparse, i.e. the agent must take many correct actions before receiving a reward. Or, the environment may have many local optima, i.e. places where the agent could go that appear to be leading it towards the maximum reward but is actually an incorrect path. Both of these issues may be possible to solve using brute-force random searching but will require many, many samples to do so. They contribute to the millions of samples required to train Snoopy Pop. In some cases, it may never find the optimal behavior.

但是，并非所有的最佳行为都可以通过随机行为轻易找到。例如，奖励可能是稀疏的，即代理必须在收到奖励之前采取许多正确的措施。或者，环境可能具有许多局部最优值 ，即代理人可以去的地方似乎正在引导他获得最大回报，但实际上是一条错误的路径。使用蛮力随机搜索可以解决这两个问题，但这样做需要很多样本。他们贡献了训练史努比流行音乐所需的数百万个样本。在某些情况下，它可能永远找不到最佳行为。

But what if we could do a bit better by guiding the agent towards a good behavior by providing it with human demonstrations of the game? This area of research is called Imitation Learning and was added to ML-Agents in v0.3. One of the drawbacks of Imitation Learning in ML-Agents was that it could only be used independently of reinforcement learning, training an agent purely on demonstrations but without rewards from the environment.

但是，如果我们可以通过向代理商提供人类游戏演示来引导代理商采取良好的行为来做得更好呢？该研究领域称为模仿学习，已在v 0.3中添加到ML-Agents中。 ML-Agents中的模仿学习的缺点之一是，它只能独立于强化学习使用，只能在演示中训练代理，而没有来自环境的回报。

In v0.9, we introduced GAIL, which addresses both of these issues, based on research by Jonathan Ho and his colleagues. You can read more about the algorithm in their paper.

在v0.9中，我们根据乔纳森·何(Jonathan Ho)和他的同事们的研究引入了GAIL，它解决了这两个问题。您可以在他们的论文中阅读有关该算法的更多信息。

To use Imitation Learning with ML-Agents, you first have a human player (or a bot) play through the game several times, saving the observations and actions to a demonstration file. During training, the agent is allowed to act in the environment as usual and gather observations of its own. At a high level, GAIL works by training a second learning algorithm (the discriminator, implemented with a neural network) to classify whether a particular observation (and action, if desired) came from the agent, or the demonstrations. Then, for each observation the agent gathers, it is rewarded by how close its observations and actions are to those in the demonstrations. The agent learns how to maximize this reward. The discriminator is updated with the agent’s new observations, and gets better at discriminating. In this iterative fashion, the discriminator gets tougher and tougher—but the agent gets better and better at “tricking” the discriminator and mimicking the demonstrations.

要将模仿学习与ML-Agent结合使用，您首先需要让人类玩家(或机器人)玩几次游戏，然后将观察结果和动作保存到演示文件中。在训练期间，允许探员照常在环境中行动并收集自己的观察结果。在较高的层次上，GAIL的工作方式是训练第二种学习算法( 鉴别器 ，通过神经网络实现)，以对特定观察(或行动，如果需要)来自代理还是演示进行分类。然后，对于特工收集的每个观察，其观察和行动与示威中的观察和行动之间的接近程度将给与奖励。代理学习如何最大化此奖励。鉴别符随代理人的新观察而更新，并且在鉴别方面变得更好。以这种迭代的方式，鉴别器变得越来越强硬，但是代理人在“区分”鉴别器并模仿示威活动方面越来越好。

Because GAIL simply gives the agent a reward, leaving the learning process unchanged, we can combine GAIL with reward-based DRL by simply weighting and summing the GAIL reward with those given by the game itself. If we ensure the magnitude of the game’s reward is greater than that of the GAIL reward, the agent will be incentivized to follow the human player’s path through the game until it is able to find a large environment reward.

因为GAIL只是给予代理人奖励，而使学习过程保持不变，所以我们可以通过简单地对GAIL奖励与游戏本身给出的权重进行加权和求和，从而将GAIL与基于奖励的DRL相结合。如果我们确保游戏奖励的数量级大于GAIL奖励的数量，则将激励代理商遵循人类玩家通过游戏的道路，直到能够找到较大的环境奖励。

Figure: Generative Adversarial Imitation Learning

图：生成对抗式模仿学习

ML-Agents v0.10：Soft Actor-Critic (ML-Agents v0.10: Soft Actor-Critic)

Since its initial release, the ML-Agents Toolkit has used Proximal Policy Optimization (PPO) – a stable, flexible DRL algorithm. In v0.10, in the interest of speeding up your training on real games, we released a second DRL algorithm, SAC, based on work by Tuomas Haarnoja and his colleagues. One of the critical features of SAC, which was originally created to learn on real robots, is sample-efficiency. For games, this means we don’t need to run the games as long to learn a good policy.

自最初发布以来，ML-Agents工具包就使用了近端策略优化(PPO) –一种稳定，灵活的DRL算法。在v0.10中，为了加快您对真实游戏的培训，我们基于 Tuomas Haarnoja及其同事的工作发布了第二种DRL算法SAC 。 SAC的关键功能之一是效率高，它最初是为在真实的机器人上学习而创建的。对于游戏而言，这意味着我们无需运行游戏就可以学习良好的政策。

DRL algorithms fall into one of two categories–on-policy and off-policy. An on-policy algorithm such as PPO collects some number of samples, learns how to improve its policy based on them, then updates its policy accordingly. By collecting samples using its current policy, it learns how to improve itself, increasing the probability of taking rewarding actions and decreasing those that are not rewarding. Most modern on-policy algorithms, such as PPO, learn a form of evaluation function as well, such as a value estimate (the expected discounted sum of rewards to the end of the episode given the agent is in a particular state) or a Q-function (the expected discounted sum of rewards if a given action is taken at a particular state). In an on-policy algorithm, these evaluators estimate the series of rewards assuming the current policy is taken. Without going into much detail, this estimate helps the algorithm train more stably.

DRL算法可分为两类之一：策略上和策略外。一种基于策略的算法(例如PPO)会收集一些样本，学习如何基于这些样本改进其策略，然后相应地更新其策略。通过使用其当前政策收集样本，它学会了如何改善自身，增加采取奖励措施的可能性并减少那些没有奖励措施的可能性。大多数现代的策略性算法(例如PPO)也学习一种评估功能，例如价值估算(假设特工处于特定状态，到情节结束时预期的奖励折扣总和)或Q功能(如果在特定状态下采取了给定的行动，则预期的折扣折扣总额)。在基于策略的算法中，这些评估人员假设采用当前策略，可以估算一系列奖励。无需赘述，此估计有助于算法更稳定地训练。

Off-policy algorithms, such as SAC, work a bit differently. Assuming the environment has fixed dynamics and reward function, there exists some optimal relationship between taking a particular action at a given state, and getting some cumulative reward (i.e., what would the best possible policy be able to get?) If we knew this relationship, learning an effective policy would be really easy! Rather than learning how good the current policy is, off-policy algorithms learn this optimal evaluation function across all policies. This is a harder learning problem than in the on-policy case–the real function could be very complex. But because you’re learning a global function, you can use all the samples that you’ve collected from the beginning of time to help learn your evaluator, making off-policy algorithms much more sample-efficient than on-policy ones. This re-use of old samples is called experience replay, and all samples are stored in a large experience replay buffer that can store 100’s (if not thousands) of games worth of data.

非策略算法(例如SAC)的工作方式略有不同。假设环境具有固定的动力和奖励功能，则在给定状态下采取特定的行动与获得一定的累积奖励之间存在某种最佳关系(即，最佳策略将能够获得什么？)如果我们知道这种关系，学习有效的政策真的很容易！非策略算法不是学习当前策略的性能，而是学习所有策略的最佳评估功能。这比在策略情况下要更难学习，因为实际功能可能非常复杂。但是，由于您正在学习全局函数，因此可以使用从开始就收集的所有样本来帮助您学习评估者，从而使非策略算法比基于策略的算法效率更高。这种对旧样本的重用称为体验重播 ，所有样本都存储在大型的体验重播缓冲区中，该缓冲区可以存储100个(如果不是数千个)游戏数据。

For our toolkit, we’ve adapted the original SAC algorithm, which was designed to do continuous action locomotion tasks, to support all of the features you’re used to in ML-Agents – Recurrent Neural Networks (memory), branched discrete actions, curiosity, GAIL, and more.

对于我们的工具包，我们改编了原始的SAC算法，该算法旨在执行连续动作移动任务，以支持您在ML-Agent中使用的所有功能- 递归神经网络(内存) ，分支离散动作，好奇心，盖尔等等。

Figure: Off-policy vs. On-Policy DRL Algorithms

图：非政策与政策上的DRL算法

史努比流行音乐的表演结果 (Performance results in Snoopy Pop)

In our previous experiments, we demonstrated that for a complex level of Snoopy Pop (Level 25), we saw a 7.5x decrease in training time going from a single environment (i.e., v0.7 of ML-Agents) to 16 parallel environments on a single machine. This meant that a single machine could be used to find a basic solution to Level 25 in under 9 hours. Using this capability, we trained our agents to go further and master Level 25—i.e., solve Level 25 to human performance. Note this takes a considerably longer time than simply solving the level—an average of about 33 hours.

在我们之前的实验中，我们证明了对于复杂级别的史努比流行音乐(级别25)，从单个环境(即ML-Agents v0.7)到16个并行环境，训练时间减少了7.5倍一台机器。这意味着一台机器可以在9小时内找到Level 25的基本解决方案。使用此功能，我们培训了业务代表以进一步发展并掌握25级，即解决25级对人员绩效的影响。请注意，这比简单地解决问题要花费更长的时间-平均大约需要33个小时。

Here, we declare an agent to have “mastered” a level if it reaches average human performance (solves the level at or under the number of bubbles a human uses) over 1000 steps. For Level 25, this corresponds to 25.14 steps/bubbles shot, averaged from 21 human plays of the same level.

在这里，我们宣布，如果某个代理商达到了超过1000步的平均人类绩效(解决了该水平等于或低于人类使用的气泡的数量)，它就已经“掌握”了一个水平。对于25级，这相当于25.14步/气泡，取自同一级别的21个人游戏的平均值。

We then tested each improvement from v0.9 and v0.10 incrementally, measuring the time it takes to exceed human performance at the level. ll in all, they add up for an additional 7x speedup to mastering the level! Each value shown is an average over three runs, as training times may vary between runs. Sometimes, the agent gets lucky and finds a good solution quickly. All runs were done on a 16-core machine with training accelerated by a K80 GPU. 16 instances were run in parallel during training.

然后，我们逐步测试了v0.9和v0.10的每个改进，并测量了达到该水平所需的性能。总而言之，它们加起来可以使掌握水平提高7倍！显示的每个值都是三个跑步的平均值，因为跑步之间的训练时间可能会有所不同。有时，代理会很幸运并Swift找到一个好的解决方案。所有运行都在16核机器上进行，并通过K80 GPU加速了培训。训练期间并行运行了16个实例。

For the GAIL experiments, we used the 21 human playthroughs of Snoopy Pop as demonstrations to train the results. Note that the bubble colors in Level 25 are randomly generated, so in no way do the 21 playthroughs cover all possible board configurations of the level. If so, the agent would learn very fast by memorizing and copying the player behavior. We then mixed a GAIL reward signal with the one provided by the Snoopy Pop game, so that GAIL can guide the agent’s learning early in the process but allow it to find its own solution later.

在GAIL实验中，我们以史努比流行音乐的21个人类游戏作为示范来训练结果。请注意，级别25中的气泡颜色是随机生成的，因此21个穿通Kong不会覆盖该级别的所有可能的电路板配置。如果是这样，代理将通过记忆和复制玩家行为来快速学习。然后，我们将GAIL奖励信号与史努比流行游戏提供的信号混合在一起，以便GAIL可以指导代理在该过程的早期学习，但允许其稍后找到自己的解决方案。

	Parallel Environments (v0.8)	Asynchronous Environments (v0.9)	GAIL with PPO (v0.9)	SAC (v0.10)	GAIL with SAC (v0.10)
Time to Reach Human Performance (hours)	34:03	31:08	23:18	5:58	4:44
Sample Throughput (samples/second)	10.83	14.81	14.51	15.04	15.28

	并行环境(v0.8)	异步环境(v0.9)	带PPO的GAIL(v0.9)	SAC(v0.10)	带SAC的GAIL(v0.10)
时间达到人类绩效 (小时)	34:03	31:08	23:18	5:58	4:44
样品通量(样品/秒)	10.83	14.81	14.51	15.04	15.28

Let’s visualize the speedup in graph format below. We see that the increase in sample throughput by using asynchronous environments results in a reduction of training time without any changes to the algorithm. The bigger reductions in training time, however, come from improving the sample efficiency of training. Note that sample throughput did not change substantially between ML-Agents v0.9 and v0.10. Adding demonstrations and using GAIL to guide training meant that the agent used 26% fewer samples to reach the same training behavior, and we see a corresponding drop in training time. Switching to Soft Actor-Critic, an off-policy algorithm, meant that the agent solved the level with 81% fewer samples than vanilla PPO, and additional improvement is seen by adding GAIL to SAC.

让我们以下面的图形格式可视化加速。我们看到，通过使用异步环境来增加样本吞吐量可以减少训练时间，而无需对算法进行任何更改。但是，培训时间的更大减少来自提高培训的样本效率。请注意，在ML-Agent v0.9和v0.10之间，样本吞吐量没有实质性变化。添加演示并使用GAIL指导培训意味着代理商可以减少26％的样本来达到相同的培训行为，并且我们看到培训时间相应减少。切换到非策略算法Soft Actor-Critic，意味着该代理解决的水平比普通PPO少81％ ，并且通过向GAC中添加GAIL可以看到进一步的改进。

These improvements aren’t unique to the new reward function and goal of reaching human performance. If we task SAC+GAIL with simply solving the level, as we had done in our previous experiments, we are able to do so in 1 hour, 11 minutes, vs. 8 hours, 24 minutes.

这些改进并非新的奖励功能和达到人类绩效的目标所独有。如果像以前的实验中那样，仅通过简单地解决水平就给SAC + GAIL分配了任务，那么我们可以在1小时11分钟而不是8小时24分钟内完成。

下一步 (Next steps)

If you’d like to work on this exciting intersection of Machine Learning and Games, we are hiring for several positions, please apply!

如果您想在机器学习和游戏这个令人兴奋的交叉领域工作，我们正在招聘几个职位，请申请！

If you use any of the features provided in this release, we’d love to hear from you. For any feedback regarding the Unity ML-Agents Toolkit, please fill out the following survey and feel free to email us directly. If you encounter any issues or have questions, please reach out to us on the ML-Agents GitHub issues page.

如果您使用此版本中提供的任何功能，我们很乐意收到您的来信。有关Unity ML-Agents工具包的任何反馈，请填写以下调查表，并随时直接给我们发送电子邮件。如果您遇到任何问题或疑问，请在ML-Agents GitHub问题页面上与我们联系。

翻译自: https://blogs.unity3d.com/2019/11/11/training-your-agents-7-times-faster-with-ml-agents/