Recently I started to do exercise with a new RL method: Actor-Critic. The basic idea is quite similar to GAN network. Unlike Policy Gradient in which a single neural network is trained, Actor-Critic uses two neural networks: an “actor” for performing actions and a “critic” for evaluating action of the actor. I refer to this page for details of this method.
Model presentation
Given a state s_t
at time t, the network calculates the action probabilities and chooses an action a_t
to apply on env:
After the application of action a_t
, then env updated accordingly, yields reward r_t
, we observe at next time step t+1 a new state s_t+1
which results in a new sampled action a_t+1
:
Steps of the basic Actor-Critic method
Unlike the Policy Gradient which needs to gather all data of an episode to calculate gradients, Actor-Critic method performs model update in every agent-env interaction loop :
Implementation with CartPole game
I followed this youtube video for my first Actor-Critic exercise. It was initially based on pure Keras thus I have modified some of his code for using tensorflow. I also disabled eager mode because of a runtime error (I am using Tensorflow v2 on my laptop).
1 | import tensorflow as tf |
A newer version
I have updated the program shown above to allow eager mode. The main difference is the introduction of the custom loss function for actor. It works well, however, the running speed is much slower after my modification (I am looking forward to Refactoring it later).
1 |
|
The training converges ideally. Below is a snapshot of the execution output:
References:
*https://github.com/wangshusen/DRL
*https://www.youtube.com/watch?v=2vJtbAha3To