学习自动驾驶技术学习之路

Do you remember learning to ride a bicycle as a child? Excited and mildly anxious, you probably sat on a bicycle for the first time and pedalled while an adult hovered over you, prepared to catch you if you lost balance. After some wobbly attempts, you perhaps managed to balance for a few metres. Several hours in, you probably were zipping around the park on gravel and grass alike.

您还记得小时候学过骑自行车吗？兴奋而轻度的焦虑，您可能是第一次坐在自行车上，踩着踏板，而一个成年人却在您上方盘旋，准备在失去平衡时抓住您。经过一些摇摆不定的尝试后，您也许可以平衡几米。几个小时后，您可能正在沙砾和草地上在公园中穿梭。

The adult would have only given you brief tips along the way. You did not need a dense 3D map of the park nor a high fidelity laser on your head. You did not need a long list of rules to follow to be able to balance on the bicycle. The adult simply gave you a safe environment for you to learn how to map what you see to what you should do, to successfully ride a bicycle.

这个成年人只会给你一些简单的提示。您不需要公园的密集3D地图，也不需要头上的高保真激光。您不需要遵循很长的规则就能在自行车上保持平衡。成年人只是为您提供了一个安全的环境，让您学习如何将看到的东西映射到应该做的事情上，从而成功地骑自行车。

Today’s self-driving cars have been packed with a large array of sensors, and are told how to drive with a long list of carefully hand-engineered rules through slow development cycles. In this blogpost, we go back to basics, and let a car learn to follow a lane from scratch, with clever trial and error, much like how you learnt to ride a bicycle. Have a look at what we did:

如今的自动驾驶汽车已装有大量传感器，并被告知如何通过缓慢的开发周期使用一整套精心设计的手工规则驾驶。在这篇博文中，我们回到了基础知识，让汽车学会了从头开始沿车道行驶，并经过了巧妙的反复试验，就像您学会了如何骑自行车一样。看看我们做了什么：

In just 15–20 minutes, we were able to teach a car to follow a lane from scratch, only by using when the safety driver took over as training feedback.

在短短的15-20分钟内，我们只有在安全驾驶员接手训练反馈时才能够教导汽车从零开始行驶。

没有密集的3D地图。
没有手写规则。 (No dense 3D map.
No hand-written rules.)

This is the first example where an autonomous car has learnt online, getting better with every trial. So, how did we do it?

这是无人驾驶汽车在线学习的第一个例子，每次试验都会变得更好。那么，我们是怎么做到的呢？

We adapted a popular model-free deep reinforcement learning algorithm (deep deterministic policy gradients, DDPG) to solve the lane following task. Our model input was a single monocular camera image. Our system iterated through 3 processes: exploration, optimisation and evaluation.

我们采用了流行的无模型深度强化学习算法(深度确定性策略梯度，DDPG)来解决车道跟踪任务。我们的模型输入是单个单眼相机图像。我们的系统通过3个过程进行迭代：探索，优化和评估。

Our network architecture was a deep network with 4 convolutional layers and 3 fully connected layers with a total of just under 10k parameters. For comparison, state of the art image classification architectures have 10s of millions of parameters.

我们的网络架构是一个深层网络，具有4个卷积层和3个完全连接的层，总共有不到1万个参数。为了进行比较，现有技术的图像分类架构具有数以千万计的参数。

All processing was performed on one graphics processing unit (GPU) on-board the car.

所有处理均在车上的一个图形处理单元(GPU)上进行。

Working on a real robot in a dangerous real environment poses many new problems. In order to better understand the task at hand and find suitable model architectures and hyperparameters, we did a lot of testing in simulation.

在危险的真实环境中使用真实的机器人工作会带来许多新问题。为了更好地理解手头的任务并找到合适的模型架构和超参数，我们在仿真中进行了大量测试。

Above is an example of our lane following simulated environment shown from different angles. The algorithm only sees the driver perspective i.e. the image with the teal border. At every episode, we randomly generate a curved lane to follow, as well as the road texture and lane markings. The agent explores until it leaves the lane, when the episode terminates. Then the policy optimises based on collected data and we repeat.

上面是从不同角度显示模拟环境的车道示例。该算法只能看到驾驶员的视角，即带有蓝绿色边框的图像。在每个情节中，我们都会随机生成一条要遵循的弯曲车道以及道路纹理和车道标记。当情节终止时，特工进行探索直到离开小路。然后，该策略会根据收集到的数据进行优化，然后重复进行。

We used simulated tests to try out different neural network architectures and hyperparameters until we found settings which consistently solved the task of lane following in very few training episodes i.e. with little data. For example, one of our findings was that training the convolutional layers using an auto-encoder reconstruction loss significantly improved stability and data-efficiency of training. See our full technical report for more details.

我们使用模拟测试来尝试不同的神经网络架构和超参数，直到我们发现设置可以在极少的训练情节(即数据很少)中始终解决巷道跟踪的任务。例如，我们的发现之一是使用自动编码器重建损失训练卷积层可显着提高训练的稳定性和数据效率。有关更多详细信息，请参见我们的完整技术报告。

我们方法的潜在影响是巨大的。 (The potential implications of our approach are huge.)

Imagine deploying a fleet of autonomous cars, with a driving algorithm which initially is 95% the quality of a human driver. Such a system would not be wobbly like the randomly initialised model in our demonstration video, but rather would be almost capable of dealing with traffic lights, roundabouts, intersections, etc. After a full day of driving and on-line improvement from human-safety driver take over, perhaps the system would improve to 96%. After a week, 98%. After a month, 99%. After a few months, the system may be super-human, having benefited from the feedback of many different safety drivers.

想象一下，部署一支无人驾驶汽车，其驾驶算法最初的质量是人类驾驶员的95％。这样的系统不会像我们的演示视频中的随机初始化模型那样摇摆不定，而是几乎能够处理交通信号灯，环形交叉路口，十字路口等。经过一整天的驾驶和人的安全在线改进后司机接管，也许该系统将提高到96％。一周后，达到98％。一个月后，达到了99％。几个月后，该系统可能是超人的，得益于许多不同安全驾驶员的反馈。

Today’s self-driving cars are stuck at good but not good enough performance levels. Here, we have provided evidence for the first viable framework to quickly improving driving algorithms from being mediocre to being roadworthy. The ability to quickly learn to solve tasks through clever trial and error is what has made humans incredibly versatile machines capable of evolution and survival. We learn through a mixture of imitation, and lots of trial and error for everything from riding a bicycle, to learning how to cook.

当今的自动驾驶汽车一直处于良好状态，但性能还不够好。在这里，我们为第一个可行的框架提供了证据，该框架可以快速地将驾驶算法从平庸改成适合公路行驶。快速学习通过巧妙的反复试验解决任务的能力使人类拥有了无与伦比的具有进化和生存能力的多功能机器。我们从模仿到混合学习，从骑自行车到学习烹饪的所有过程中反复尝试和错误学习。

DeepMind have shown us that deep reinforcement learning methods can lead to super-human performance in many games including Go, Chess and computer games, almost always outperforming any rule based system. We here show that a similar philosophy is also possible in the real world, and in particular, in autonomous vehicles. A crucial point to note is that DeepMind’s Atari playing algorithms required millions of trials to solve a task. It is remarkable that we consistently learnt to lane-follow in under 20 trials.

DeepMind向我们展示了深度强化学习方法可以在包括Go，Chess和计算机游戏在内的许多游戏中带来超人的表现，几乎总是胜过任何基于规则的系统。我们在这里表明，在现实世界中，尤其是在自动驾驶汽车中，类似的哲学思想也是可能的。需要注意的关键一点是，DeepMind的Atari播放算法需要数百万次尝试才能解决任务。值得注意的是，我们始终在20项试验中学会了跟踪。

我们学会了在20分钟内从头开始行驶。想像一下我们一天可以学到什么？ (We learnt to follow lanes from scratch in 20 minutes. Imagine what we could learn to do in a day…?)

Wayve has a philosophy that to build robotic intelligence we do not need massive models, fancy sensors and endless data. What we need is a clever training process that learns rapidly and efficiently, like in our video above. Hand-engineered approaches to the self-driving problem have reached an unsatisfactory glass ceiling in performance. Wayve is attempting to unlock autonomous driving capabilities with smarter machine learning.

Wayve的理念是建立机器人智能，我们不需要庞大的模型，精美的传感器和无尽的数据。我们需要的是一个聪明的培训过程，可以快速，高效地学习，就像上面的视频一样。手工设计的自动驾驶方法在性能上达到了令人不满意的玻璃天花板。 Wayve试图通过更智能的机器学习来释放自动驾驶功能。

We’re hiring! wayve.ai/careers/
我们正在招聘！ wayve.ai/职业/
Full research paper: arXiv paper link, published at the International Conference on Robotics and Automation 2019.
完整的研究论文： arXiv论文链接，在2019年机器人与自动化国际会议上发表。
Follow us: twitter / linkedin
关注我们： twitter / linkedin

Special thanks: We would like to thank StreetDrone for building us an awesome robotic vehicle, Admiral for insuring our vehicle trials and the Cambridge Polo Club for granting us access to their private land for our lane-following research.

特别鸣谢：我们要感谢StreetDrone为我们制造了一款出色的机器人车辆，感谢海军上将为我们的车辆测试提供了保证，并感谢Cambridge Polo Club允许我们进入其私人土地进行车道跟踪研究。

This story was originally published at https://wayve.ai/blog/learning-to-drive-in-a-day-with-reinforcement-learning on 28th June 2018.

该故事最初于2018年6月28日发布在https://wayve.ai/blog/learning-to-drive-in-a-day-with-reinforcement-learning中。

翻译自: https://medium.com/wayve/learning-to-drive-in-a-day-30f0b616dd27

学习自动驾驶技术学习之路

没有密集的3D地图。 没有手写规则。 (No dense 3D map.No hand-written rules.)

我们方法的潜在影响是巨大的。 (The potential implications of our approach are huge.)

我们学会了在20分钟内从头开始行驶。 想像一下我们一天可以学到什么？ (We learnt to follow lanes from scratch in 20 minutes. Imagine what we could learn to do in a day…?)

没有密集的3D地图。
没有手写规则。 (No dense 3D map.
No hand-written rules.)

我们学会了在20分钟内从头开始行驶。想像一下我们一天可以学到什么？ (We learnt to follow lanes from scratch in 20 minutes. Imagine what we could learn to do in a day…?)