Learning to Drive with Unity ML-Agents – A Beginners Guide to Deep RL for Autonomous Vehicles


This article is the first in a series devoted to the topic of designing an agent capable of learning to do a task well, with a particular emphasis on reinforcement learning (RL) for autonomous vehicles. We find that the task of driving is quite challenging, providing a rich case study to consider when discussing the problem of good agent design. For this reason, we hope that this discussion of an agent’s action, observation and reward spaces and how their design affects its ability to learn will be useful, not only to those studying reinforcement learning applied to driving, but to researchers and practitioners interested in agent and task design generally.

If you’d like to follow along at home, the Unity environments and agent implementations to accompany this article are available as a Github repository.

GitHub icon
Reinforcement Learning for Autonomous Vehicles.

One quick note: While this is written as a beginners guide, we assume that the reader already has at least some prior RL understanding, so we won’t be defining some of the most basic terms, such as what a reward function is or how a policy is computed. If you’d like to brush up on these topics, a great place to start is David Silver’s excellent Introduction to Reinforcement Learning. OpenAI’s Spinning Up in Deep RL is also a fantastic introduction that provides an approachable mix of theory and example code.

Much of the literature in reinforcement learning is devoted to the development of ever more general, efficient learning algorithms, typically focusing on performance against a common set of benchmark tasks like learning to walk or to play arcade games. Less frequently discussed are the conditions necessary to learn a good policy in general. For example, how can we know if a given set of observation inputs is sufficient to accomplish a particular task? We will try to explore that and other questions in this series. As of this writing, it is still the case that state of the art RL algorithms often struggle or fail completely to learn in an environment with sparse rewards, and similar issues arise when the observation and action spaces of the agent are not carefully considered. When attempting to apply existing learning algorithms to a new problem, it is exactly these issues that are likely to concern us most.

Task Description

Driving in the real world is multifaceted and complex. While driving, it will typically be necessary to perform a number of different maneuvers, often simultaneously. Some of these include the following:

  • Lane following
  • Lane changing
  • Handling road intersections of various kinds
  • Maintaining speeds that are neither too fast nor too slow
  • Avoiding collision with other vehicles, pedestrians, obstacles, etc.
  • Responding to traffic control devices

This list is by no means exhaustive, but essentially, the goal of driving is to safely navigate from point A to point B, so this is the first task we must consider.

Learning Environment

Although the ultimate goal is to train agents that can navigate autonomously in the real world, we obviously want to develop our approach safely, and using a simulation environment is crucial to achieve this. There are now several driving simulators that provide nice features like sophisticated sensors, realistic weather conditions and more. But with those features comes a certain irreducible amount of complexity.

In our experience, the pitfalls in reinforcement learning are often the result of poorly thought out assumptions rather than bugs in code or imperfectly tuned hyperparameters. When starting a new problem, it is important to strip away all but the most essential elements. Unless we are able to learn effectively at the most basic level, we will have little chance of developing an agent that can control a much more complex system. For now, we will focus on Unity and their ML-Agents Toolkit to develop our approach. Unity provides a powerful platform to develop environments and agents that can be as simple or as complex as we would like to make them. The ML-Agents framework also makes it easy to train many agents in parallel, effectively speeding up the hypothesis › experiment › measurement cycle.

Starting with a Baseline

Our initial implementation will essentially match the ML-Agents tutorial for Creating a New Learning Environment. This tutorial walks you through creating a scene with a simple RollerAgent with the goal of reaching a target. If you haven’t built a learning environment in Unity before, we highly suggest taking the time to follow this tutorial, as it gives a great introduction to the mechanics of ML Agents. Conveniently, it also defines the task we want to achieve, albeit for an agent that has a very different action space.

The RollerAgent in the tutorial is a controller for a simple sphere. The action space consists of 2 continuous control signals, which are applied to the sphere’s rigid body as physical forces in the x and z axes. Our DrivingAgent will mimic a car, complete with non-holonomic steering. While it is unlikely that substituting this DrivingAgent for the RollerAgent will just work out of the box, using this simple setup as a baseline gives us a concrete starting point that should, under the right conditions, work.

Agent (red) and Target (cyan)

Agent and Controller

In a real self-driving car, the decision making layer is only one of many modules that need to interact with each other. While we won’t be simulating a realistic vehicle in this article, we can at least approximate that modularity by separating the decision making Agent from the lower level Vehicle Controller. This also allows us to reuse the controller code as we iterate on our agent’s design.

The simplest implementation of a car in Unity requires a simple Cube primitive for a chassis and four Wheel Colliders. This tutorial by Unity explains how to build a very simple driving controller that roughly approximates the way a real car drives. We’ve made some modifications to it so that the controller is callable by our Agent, and so we don’t have to tune the various parameters for each individual Wheel Collider.

The Vehicle Controller (located here) has 3 settable properties: Steering, Throttle, and Brake. These correspond to the Agent’s action space, which is made up of 3 continuous values. The Steering is clipped to values between -1 and +1, corresponding to steering angles from extreme left to extreme right, and scaled by a maxSteeringAngle property that we can adjust in the Unity editor. The Throttle is also clipped to values between -1 and +1, corresponding to throttle values between maximum reverse and maximum forward throttle, and scaled by the maxMotorTorque property. Finally, the Brake is clipped to values between 0 and +1 and scaled by the maxBrakeTorque property.

As an aside, it is worth mentioning that representing the reverse gear as a negative throttle action from the Agent would probably not be a good idea in the real world. We would never want to allow the Agent the ability to oscillate rapidly between forward and reverse throttles, which it may choose to do in the current configuration in order to moderate its speed. A better approach might be to create a state machine that ensures that the agent can only transition from a forward driving state to a reverse state by first coming to a full stop. That way the throttle could always be a value between 0 and +1. However, achieving this level of sophistication is beyond the scope of the current article.

ReachTargetAgent Class Diagram

The Agent itself is implemented here. Note that this base class is missing the CollectObservations method that was described in the ML Agents tutorial. This is by design. For each of the experiments below, we will create a subclass of this basic ReachTargetAgent, and the only changes we will need to make will be to the observation and/or rewards, both of which we can do from the CollectObservations method, described in detail below.

When the Agent reaches the Target, it receives a reward of +1 and the Target is respawned to a new random location. When the Agent falls off the edge of the platform, the episode is reset. In this implementation, we do not reset the agent position when it successfully reaches the goal. This allows the agent to gradually learn to navigate from many initial conditions, rather than always restarting at zero velocity. If all goes well, this should encourage the Agent to get very good at efficiently navigating to the goal and then quickly reorienting as the goal is moved.

In order to sanity check that our Agent controls make sense, ML-Agents provides a Heuristic method, which enables us to map keyboard or joystick commands to the Agent’s action outputs. In our Unity project, we can test this by going to the Assets > ReachTarget > Scenes folder in the Project section and double clicking on ReachTarget_0 to open up the first scene. If everything is set up correctly, you should be able to press the “Play” button and drive the Agent around using the arrow keys of the keyboard for steering and throttle and the space bar for the brake.

Collecting Observations

The CollectObservations method will be our focus for the remainder of this article. This initial implementation (ReachTargetAgent_0) defines an observation space, including the Target position, Agent position and Agent velocity. This essentially matches the RollerAgent tutorial mentioned above. The Target and Agent positions and the Agent velocity are each real valued vectors in x, y and z for a total of 9 observed values.

public override void CollectObservations(VectorSensor sensor)
   // Target and agent positions
   sensor.AddObservation(target.position); // vec 3
   sensor.AddObservation(this.transform.position); // vec 3

   // Agent velocity
   sensor.AddObservation(AgentVelocity); // vec 3

Training the Agent

For all of our experiments in this article, we will be training with the ML-Agents implementation of the Proximal Policy Optimization algorithm. This file defines our hyperparameter configurations. Note that we will leave the hyperparameters fixed for all of the experiments run in this article. However, we have defined a separate section for each experiment so that the saved models are easy to identify. You can learn more about what each of the configuration values means in this doc.

At the end of the above referenced tutorial, there is a section describing how to train multiple agents simultaneously. This is an awesome feature for getting results quickly. In our experiments, we use a simple script to replicate instances of the learning scene, called Prefabs in Unity, so that we don’t have to manage them manually. You can find that here.

Now that we have a working DrivingAgent and have defined an observation space and reward, let’s run the experiment forward and see some results!

  • First, run this on a command line, from within the project directory:
    mlagents-learn config/trainer_config.yaml --run-id=reach-target-0 --train
  • Then press Play in the Unity editor.

The agent should start moving around, exploring its environment, and (hopefully) it will learn something useful over time. You can monitor its progress on the command line, and/or using TensorBoard (tensorboard --logdir=summaries/). The following GIF shows the behavior of the model after training for 2 million steps:

ReachTargetAgent_0ReachTargetAgent_0 trained for 2 million stepsFigure 1. ReachTargetAgent_0 trained for 2 million steps

Recall that the reward for reaching the target is +1. The figure above graphs the cumulative rewards per episode, averaged over every 30,000 steps, which is how often we record the summary statistics. The average reward per episode after 2 million steps never exceeds 0.025 and does not noticeably improve. In this case, the episode length shown above is divided by the frequency that the Agent makes a decision, which it does once every 10 time steps. This means the Agent’s average episode length is typically between 1800 and 2000 steps, indicating that the Agent usually wanders around and only very infrequently, probably at random, reaches the Target or falls off the platform, and in fact that is exactly what we see in the GIF above. All of this is in contrast to the RollerAgent, which reliably reaches the goal nearly 100% of the time after only about 20-30,000 timesteps. Why might this be?

A Sense of Direction

As mentioned before, perhaps the most obvious difference between our DrivingAgent and the RollerAgent is its configuration. Our agent is constrained in the directions it can travel. Only two of its wheels can steer and the steering angle is limited. If the target is to the left or right, the Agent cannot move directly toward it without performing a series of steering maneuvers. Since the reward is sparse and the chances of reaching the target by acting randomly are small, the Agent has seemingly little opportunity to learn an effective driving policy, even after millions of time steps. In order to induce the behavior we want to accomplish, perhaps it would help to give the agent a hint that it is moving in the right direction. This takes us straight into reward shaping territory.

The idea behind reward shaping is to provide a supplementary reward to help guide the agent towards the desired solution. In the case of our DrivingAgent, one way to shape the reward would be to compute the direction from the agent’s current position to the target. If we then project the agent’s velocity vector onto this direction vector (by computing their dot product), we get a scalar value reflecting the extent to which the agent is moving toward or away from the target. We can provide this to the agent in the form of a small reward at every time step, like so:

// The direction from the agent to the target
Vector3 dirToTarget = (target.position - this.transform.position).normalized;

// The alignment of the agent's velocity with this direction
float velocityAlignment = Vector3.Dot(dirToTarget, AgentVelocity);

// Small reward for moving in the direction of the target
AddReward(RewardScalar * velocityAlignment);

Instead of only receiving a very occasional reward upon reaching the goal, the velocityAlignment provides a richer signal for the agent to evaluate when computing the value of taking an action in any given state. However, we don’t want to overwhelm the primary reward for achieving our goal, so we scale the velocityAlignment by a RewardScalar to make it very small, in this case multiplying it by 1 / 2000 (the maximum time steps in the episode).

Having made this small change, let’s run the experiment forward again to see if our updated Agent is any better at completing the task.

ReachTargetAgent_1ReachTargetAgent_1 (gray) gets an additional velocity alignment reward every stepFigure 2. ReachTargetAgent_1 (gray) gets an additional velocity alignment reward every step

The gray line in Figure 2 shows the effect of adding this additional reward. There is now a clear trend of improvement, which is a good sign that a better-than-random policy is starting to take shape. However, the progress is slow, and upon observing the agent’s behavior, the results definitely seem suboptimal. Can we do better?

Expanding the Observation

When designing agents that operate in the real world, we tend to think in terms of what information can be obtained through sensors. That information is necessarily partially observable at best. Sensors have limited range and direction, are subject to error, and are often occluded by various objects in their sensing range. Also, the more information that is given per time step, the more computation is required to make sense of it, and the longer it takes to learn. For these reasons, we tend to start from the bottom up, trying to add only the information that is necessary and sufficient for the agent to determine the next best action to take.

So, what is the minimum information that is both necessary and sufficient? Essentially, an observation needs to have the infamous Markov property. We can say that an observation is Markovian if any two sets of events that lead to that observation have an equal likelihood of resulting in the same next observation, given the same next event. Put differently, a Markovian observation can be said to be “memoryless”, because it compactly summarizes all the information needed to predict the likely outcome of the next event. This handy property is what our DrivingAgent needs to exploit when computing the value of taking an action in a given situation.

Considered from this perspective, it is not enough to know the DrivingAgent’s velocity and position relative to the target. When the agent in question was a sphere, this was not a problem, because at any given moment, a force acting on the sphere could move it in the desired direction with no constraints. However, two of our DrivingAgents in exactly the same position, with exactly the same velocity but with different orientations, will have different outcomes when throttle is applied, because of the constraints placed on them by their wheels. We can now clearly see that including the DrivingAgent’s orientation in the observation is necessary for it to decide what the next best action should be. If we enable our agent to observe its current heading as well as the direction to the target, it might be able to learn the series of actions needed to reorient itself and move in the right direction.

// Direction to target and agent heading

Here we have added the direction from the Agent to the Target as well as the vehicle’s forward direction to the observation. This should be enough for our Agent to learn a relationship between its current heading and the direction it should be traveling. Another potentially useful observation that we did not add is the Agent’s steering angle. Because the normalized steering angle should be exactly the steering action given by this toy DrivingAgent, we don’t think it’s necessary to add here. However, in a more realistic vehicle, we would almost certainly want to observe the current steering angle of both wheels. Let’s run another training session to see how these changes affect things.

ReachTargetAgent_2ReachTargetAgent_2 (red) is able to observe the direction to target and its own headingFigure 3. ReachTargetAgent_2 (red) is able to observe the direction to target and its own heading

Now we see that things are really starting to improve. While the average reward, indicated by the red line in Figure 3, is only roughly 0.25 to 0.3 after 2 million time steps, the trend is definitely going up, and it appears likely that continuing to train would yield more improvements. However, we might be able to do better still.

A Change in Perspective

After seeing the difference these changes to the observation space have made, we start thinking a little more carefully about what exactly it is the Agent “sees”. There are currently 15 numerical values that the agent observes:

  • 3 values for the Target position
  • 3 values for the Agent’s current position
  • 3 values for the Agent’s velocity
  • 3 values for the direction from the Agent to the Target
  • 3 values for the direction the Agent is facing

At the outset, the Agent has no way to know what these numbers represent nor what relationship any number has to another. The Agent simply takes random actions and observes what changes. Values 1 through 3 occasionally and randomly change as the episodes reset. However, sometimes, as the Euclidean distance between values 1 through 3 and values 4 through 6 becomes less than 1, a magical thing happens: the first 3 values suddenly change and a relatively large reward of +1 is received.

It feels like a lot to ask the Agent to learn an abstract concept like Euclidean distance. It is almost magical that it works at all, but what if we could simplify the problem? Instead of considering things from a global point of view, where the numbers have little meaning beyond their relationship to each other and that relationship must be learned, what if we change the reference frame, so that the Agent is always at the origin?

// Target position in agent frame
   this.transform.InverseTransformPoint(target.transform.position)); // vec 3

// Agent velocity in agent frame
   this.transform.InverseTransformVector(AgentVelocity)); // vec 3

// Direction to target in agent frame
   this.transform.InverseTransformDirection(dirToTarget)); // vec 3

With this set of changes, we are able to eliminate the Agent’s pose from the observation and only consider the other elements of the observation from the Agent’s point of view. Not only does this mean there are fewer values to observe, but now every action has a clear and easily interpretable effect on the Agent’s distance from and direction to the Target. With this new ego-centric perspective, the Agent becomes less like a passive observer looking at the data from a distance, trying to figure out how the numbers relate to each other. Now each change in the observation can be attributed to some action. Let us once again run this experiment and see how these changes affect our Agent’s ability to learn.

ReachTargetAgent_3ReachTargetAgent_3 (blue) transforms observations to an ego-centric frame of referenceFigure 4. ReachTargetAgent_3 (blue) transforms observations to an ego-centric frame of reference

Figure 4 shows a dramatic improvement! This change in perspective has drastically reduced the time needed to learn. In the GIF, we can now see that once the Agent starts to reliably reach the Target, it quickly becomes adept at changing course and navigating to the next Target as it is randomly repositioned. Eureka!

Revisiting the Original Goal

With such a stark difference, it seems clear that the change in observation was a very, perhaps the most, important factor contributing to our Agent’s ability to learn a good policy. This begs the question, what if we removed the auxiliary velocityAlignment reward and retrained the Agent with this improved observation space and a simple reward of +1 for reaching the Target? Let’s do that and see what happens!

ReachTargetAgent_4ReachTargetAgent_4 (magenta) only gets the original sparse rewardFigure 5. ReachTargetAgent_4 (magenta) only gets the original sparse reward

The magenta line in Figure 5 shows that the Agent is now quite capable of solving the task, even with a sparse reward. It does appear to take longer for the Agent to start gaining proficiency without the velocityAlignment reward, as expected. But the difference between the original ReachTargetAgent_0 implementation and this updated version with a transformed observation space couldn’t be more stark. At this point, it would probably make sense to start refining the agent design, experimenting with hyperparameters, and so on, but that is beyond the scope of this article. The final version of the CollectObservations method is available for you to look at here (ReachTargetAgent_4).


We have shown that a careful consideration of the environment and agent’s design can make the difference between mastering a task and learning a policy that is not much better than random. We made no change to the learning algorithm, learning rate, number of layers in the neural network or any other hyperparameter.

We hope you enjoyed this first step towards creating learned driving agent. While there is still a lot to do before the agent we developed in this article can competently drive in any sort of realistic scenario, this was an important first step in the journey. In future installments of this series, we’ll work on making our agent more capable by introducing lane following and changing, handling intersections, responding to traffic control devices, negotiating with other vehicles and pedestrians, and more! Follow this series if you want to develop your skills in reinforcement learning for self-driving cars.

Thank you for reading!

Follow us: twitter / linkedin