Proximal Policy Optimization

Tags: #machine learning #AI #LLM

Equation

$$\arg\max\limits_{\pi}{ E_{p \sim D,g \sim \pi} [R(g|p)] }, R(g|p) = \tilde{R}_{c}(g|p) - \beta D_{KL}( \pi_{\theta} (g|p) || \pi_{0} (g|p))$$

Latex Code

                                 \arg\max\limits_{\pi}{ E_{p \sim D,g \sim \pi} [R(g|p)] },  R(g|p) = \tilde{R}_{c}(g|p) - \beta D_{KL}( \pi_{\theta} (g|p) || \pi_{0} (g|p))
                            

Have Fun

Let's Vote for the Most Difficult Equation!

Introduction

Proximal Policy Optimization (PPO) is an optimization algorithm of reinforcement learning (RLHF) to train the reward model during the stage of finetuning a Larget Language Model (LLM). It helps to better align the LLM's generation to judgement and human's feedback. The model iteratively improve the policy by sampling prompts p from dataset D and generations g from the policy \pi, and use the PPO algorithm to optimize loss function. $$ R(g|p) $$: Final reward function , $$ \tilde{R}_{c}(g|p) $$: The reward function we define, such as the piecewise combination of the safety (Rs) and helpfulness (Rh) in LLaMa 2 model, $$ \pi_{0} (g|p) $$: Original policy to generate Response g, given Prompt p, $$ \pi_{\theta} (g|p) $$: The policy we are optimizing with parameters \theta, $$ \beta $$: KL penalty term to prevent the policy from diverging from original policy too far.

Discussion

Comment to Make Wishes Come True

Leave your wishes (e.g. Passing Exams) in the comments and earn as many upvotes as possible to make your wishes come true