Deep reinforcement learning algorithms can be hard to debug, so it helps to visualize as much as possible in the absence of a stack trace [1]. How do we know if the learned policy and value functions make sense? Seeing these quantities plotted in real time as an agent is interacting with an environment can help us answer that question.
Here’s an example of an agent wandering around a custom gridworld environment. When the agent executes the toggle
action in front of an unopened red gift, it receives a reward of 1 point, and the gift turns grey/inactive.
The model is an actor critic, a type of policy gradient algorithm (for a nice introduction, see Jonathan’s battleship post or [2]) that uses a neural network to parametrize its policy and value functions.
This agent barely “meets expectations” — notably getting stuck at an opened gift between frames 5-35 — but the values and policy mostly make sense. For example, we tend to see spikes in value when the agent is immediately in front of an unopened gift while the policy simultaneously outputs a much higher probability of taking the appropriate toggle
action in front of the unopened gift. (We’d achieve better performance by incorporating some memory into the model in the form of an LSTM).
We’re sharing a little helper code to generate the matplotlib plots of the value and policy functions that are shown in the video.
Comments
- Training of the model is not included. You’ll need to load a trained actor critic model, along with access to its policy and value functions for plotting. Here, the trained model has been loaded into
agent
with aget_action
method that returns theaction
to take, along with a numpy array ofpolicy
probabilities and a scalarvalue
for the observation at the current time step. - The minigridworld environment conforms to the OpenAI gym API, and the
for
loop is a standard implementation for interacting with the environment. - The gridworld environment already has a built in method for rendering the environment in iteractive mode
env.render('human')
. - Matplotlib’s
autoscale_view
andrelim
functions are used to make updates to the figures at each step. In particular, this allows us to show what appears to be a sliding window over time of the value function line plot. When running the script, the plots pop up as three separate figures.
References
[1] Berkeley Deep RL bootcamp - Core Lecture 6 Nuts and Bolts of Deep RL Experimentation — John Schulman (video | slides) - great advice on the debugging process, things to plot
[2] OpenAI Spinning Up: Intro to policy optimization