(This contents of this NIPS spotlight video is similar to the post below, although the post is a bit more detailed.)
Reinforcement learning agents can learn to play video games (for instance Atari games) by themselves. The original DQN algorithm and many of its successors clip the rewards they receive while learning. This helps stabilize the deep learning, but it can lead to qualitatively different solutions and in a recent paper we propose an alternative. Papers do not (yet) include videos, and so we show some concrete examples here.
In our first example, we compare the learned behavior on the Atari game Ms. Pac-Man. First, we look at the behavior of a Double DQN agent with clipped rewards. All positive rewards are clipped to +1. This means that normal pellets (+10), power pills (+50), and ghosts (+200 or more) are worth the same +1 reward as far as this agent is concerned.
The agent does a reasonable job at eating pellets but does not at all seem interested in eating the ghosts, which are worth a lot more points.
We can compare this to the same agent but now trained without clipping the rewards and using Pop-Art instead to adaptively normalize the targets for the deep net (as explained in our paper).
The difference is clear: because the agent now sees the differences between real rewards it has learned to become a proper ghostbuster.
As another example, lets look at the game Centipede. Again, we start with the agent that only sees clipped rewards.
The agent controls the square near the bottom of the screen that is shooting at the snake pieces moving down the screen. There is also a spider but the agent never shoots it, instead preferring to regularly jump onto the spider, killing itself in the process.
Now we look at the agent that uses target normalization instead of reward clipping.
There is a big change in behavior. Instead of jumping on the spider, the agent now immediately shoots it whenever it enters the screen to get a nice big juicy reward.
These qualitatively different policies don’t always lead to higher total scores. As an example, consider the game of Time Pilot. First, we look at the agent with clipped rewards.
The agent controls the plane in the center that is happily shooting at other planes. There are also blimps, but these are essentially ignored by the agent. Now consider the agent that uses target normalization.
This agent is much less interested in the planes but when the blimp enters the screen the agent hunts it with determination.
The explanation for this behavior is as follows: when the agent shoots the blimp, it transitions to the next level and it gets a healthy chunk of rewards. The clipped agent does not care about the size of this reward, and probably does not like the transition to a new, less well-known, state. The unclipped agent does appreciate the higher reward, but then does end up in later levels of the game, which it has not mastered as well yet. In total, it turns out the overall performance of the second agent is somewhat lower than that of the first agent. In other words, the second agent learns a qualitatively different behavior, which in this case turns out to be worse in the end.
It should be possible to switch to a more successful policy later, but that is more a problem how to explore efficiently. The general problem of how best to explore in reinforcement learning is not yet solved.