Deep RL Course documentation
The intuition behind PPO
Unit 0. Welcome to the course
Unit 1. Introduction to Deep Reinforcement Learning
Bonus Unit 1. Introduction to Deep Reinforcement Learning with Huggy
Live 1. How the course work, Q&A, and playing with Huggy
Unit 2. Introduction to Q-Learning
Unit 3. Deep Q-Learning with Atari Games
Bonus Unit 2. Automatic Hyperparameter Tuning with Optuna
Unit 4. Policy Gradient with PyTorch
Unit 5. Introduction to Unity ML-Agents
Unit 6. Actor Critic methods with Robotics environments
Unit 7. Introduction to Multi-Agents and AI vs AI
Unit 8. Part 1 Proximal Policy Optimization (PPO)
IntroductionThe intuition behind PPOIntroducing the Clipped Surrogate Objective FunctionVisualize the Clipped Surrogate Objective FunctionPPO with CleanRLConclusionAdditional Readings
Unit 8. Part 2 Proximal Policy Optimization (PPO) with Doom
Bonus Unit 3. Advanced Topics in Reinforcement Learning
Bonus Unit 5. Imitation Learning with Godot RL Agents
Certification and congratulations
The intuition behind PPO
The idea with Proximal Policy Optimization (PPO) is that we want to improve the training stability of the policy by limiting the change you make to the policy at each training epoch: we want to avoid having too large of a policy update.
For two reasons:
- We know empirically that smaller policy updates during training are more likely to converge to an optimal solution.
- A too-big step in a policy update can result in falling “off the cliff” (getting a bad policy) and taking a long time or even having no possibility to recover.
So with PPO, we update the policy conservatively. To do so, we need to measure how much the current policy changed compared to the former one using a ratio calculation between the current and former policy. And we clip this ratio in a range, meaning that we remove the incentive for the current policy to go too far from the old one (hence the proximal policy term).
Update on GitHub