Proximal Policy Optimization PPO

Tags: #machine learning

Equation

$$L^{CLIP}(\theta)=E_{t}[\min(r_{t}(\theta))A_{t}, \text{clip}(r_{t}(\theta), 1-\epsilon,1+\epsilon)A_{t}]$$

Latex Code

                                 L^{CLIP}(\theta)=E_{t}[\min(r_{t}(\theta))A_{t}, \text{clip}(r_{t}(\theta), 1-\epsilon,1+\epsilon)A_{t}]
                            

Have Fun

Let's Vote for the Most Difficult Equation!

Introduction

With supervised learning, we can easily implement the cost function, run gradient descent on it, and be very confident that we’ll get excellent results with relatively little hyperparameter tuning. The route to success in reinforcement learning isn’t as obvious—the algorithms have many moving parts that are hard to debug, and they require substantial effort in tuning in order to get good results. PPO strikes a balance between ease of implementation, sample complexity, and ease of tuning, trying to compute an update at each step that minimizes the cost function while ensuring the deviation from the previous policy is relatively small. https://openai.com/research/openai-baselines-ppo


Latex Code
            L^{CLIP}(\theta)=E_{t}[\min(r_{t}(\theta))A_{t}, \text{clip}(r_{t}(\theta), 1-\epsilon,1+\epsilon)A_{t}]
        
Explanation

  • : is the policy parameter
  • : denotes the empirical expectation over timesteps
  • : is the ratio of the probability under the new and old policies, respectively
  • : is the estimated advantage at time t
  • : is a hyperparameter, usually 0.1 or 0.2

Comments

Write Your Comment

Upload Pictures and Videos