Dynamic Path Planning of Unknown Environment Based on Deep Reinforcement Learning

Pygame module in PYTHON is used to establish dynamic environments.

In [11, 12], deep reinforcement learning has been applied to autonomous navigation based on the inputs of visual information, which has achieved remarkable success. In[11]只是静态的迷宫,变化的start和target,没有dynamic obstacles.

通过laser information训练DDQN
In the aspect of the recent deep reinforcement learning models, the original training mode results in a large number of samples which are moving states in the free zone in the pool, and the lack of trial-and-error punishment samples and target reward samples ultimately leads to algorithm disconvergence.
So, we constrain the starting position and target position by randomly setting target position in the area that is not occupied by the obstacles to expand the state space distribution of the pool of sample.

Deep Reinforcement Learning with DDQN Algorithm的内容看论文

Local Path Planning with DDQN Algorithm

The sensor range is regarded as observation window.
The accessible point nearest to the global path at the edge of the observation windowis considered as the local target point ofthe local path planning.
The network receives lidar dot matrix information and local target point coordinates as the inputs and outputs the direction of movement.
这他丫只是个local planner 输入costmap(方的)然后输出现在的cmd_msg
local plan不是对laser信息处理的 而是对costmap处理的:https://blog.csdn.net/weixin_42048023/article/details/83987653
但是文章说他们是接受雷达的信息和local target(局部地图的目标点)然后输出运动方向的,加入接受的雷达是costmap区域内的雷达,那么就和传统costmap的输入输出差不多了了.

we set angle resolution to be 1 degree and range limit to be 2 meters so each observation consists of360 points indicating the distance to obstacles within a two-meter circle around the robot.

然后输入360°的角度和距离的laser信息和40份target point的copy,一共800维的向量,输出8种方向的one hot 前后左右和斜向的前后左右
p是移动之后的position,g是local target point,o代表障碍物,包括其inflation.
由于一个laser point的长度可以是0~200cm,所以用qtable就很傻逼
To ensure the deep reinforcement learning training converging normally, the pool should be large enough to store state-action of each time-step and keep the training samples of neural network be independent identical distribution;保持神经网络训练样本的独立同分布.
而且我们训练的时候是sample一次然后再随机,再sample一次,看结果,这样吗,估计如果这样说的话,就不能是在一次的path planning中进行训练,而是上面说的这样,但是,这是因为神经网络训练需要这样,(DDQN也是这样训练的吗,在线等急.)
besides, the environment punishment and reward should reach to a certain proportion. If the sample space was too sparse, namely, the main states were random movements in free space, it was difficult to achieve a stable training effect.
n是当前的iteration imini_{min}是L的最小值.imaxi_{max}是L的最大值.m是空间搜索的速度,N1 N2是循环次数的阈值的超参数.
The termination of each episode is to achieve a fixed number of moving steps instead of directly terminating the current training episode when encountering obstacles or reaching the target point.
The original training mode results in a large number of samples which are moving states in the free zone in the pool,and the lack of trial-and-error punishment samples and target reward samples ultimately leads to algorithm disconvergence.
原本的那种方法,因为大量的数据都是在free space进行移动,这就导致了-1的样本特别少,大量样本是在*移动,导致不能收敛.

Local Path Planning Simulation with Pygame Module

Pygame module is to build dynamic environment.
然后说一开始就是随机探索,然后一个sample就是[s,a,r,s1,d] d代表当前epoch是否结束
并且pool size=40000
当sample存在pool中到达一定数量的时候,the network was to be trained with the randomly selected samples in the pool.
In the first 5000 time steps of the random exploration, the network parameters are not updated, but the samples of the pool are increased.After the sample size reaches to 5000, the network is trained in every four time-step movements.
一开始不训练(没有更新参数),当sample size到达5000之后,每四步移动网络就要训练一次.

随机draw 32个samples拿去update Q 估计网络,低学习率来训练Q目标网络,使得Q目标网络 approach Q估计网络.这样可以保证Q 目标网络稳定的学习.如果pool满了就以FIFO的规则把之前的sample剔除

We store the network parameters and test in an experimental environment map.
Figure 6 is a new environmental map which has never tested in training
There are three free moving obstacles with constant speed and the moving directions are denoted by the blue arrows.
In the figure, state A demonstrates that the local path is blocked by dynamic obstacle No. 2; then, the agent waits for the obstacle to move downwards. When the obstacle is out of the path, the agent moves towards upper right. As shown in state B, the agent successfully reaches the third local target. In state C, obstacle No. 3 moves towards theagent.TheagentperceivesobstacleNo. 3beforecollision and gets to the sixth local target point by moving toward the bottom right. In stateD,theagentreachestheendpointand completes the path planning without any accident collision with the obstacles.
The whole process demonstrates that the agent after being trained by DDQN is able to perceive the moving obstacles in advance with the knowledge oflidar data in unknown dynamic environment. An intelligent planning method is presented by Q target network which makes each step move towards a higher cumulative reward

