Proximal Policy Optimization PPO

Tags: #machine learning #AI #LLM

Equation

$$\arg\max\limits_{\pi}{ E_{p \sim D,g \sim \pi} [R(g|p)] }, R(g|p) = \tilde{R}_{c}(g|p) - \beta D_{KL}( \pi_{\theta} (g|p) || \pi_{0} (g|p))$$

Latex Code

                                 \arg\max\limits_{\pi}{ E_{p \sim D,g \sim \pi} [R(g|p)] },  R(g|p) = \tilde{R}_{c}(g|p) - \beta D_{KL}( \pi_{\theta} (g|p) || \pi_{0} (g|p))

Have Fun

Let's Vote for the Most Difficult Equation!

Introduction

Proximal Policy Optimization (PPO) is an optimization algorithm of reinforcement learning (RLHF) to train the reward model during the stage of finetuning a Larget Language Model (LLM). It helps to better align the LLM's generation to judgement and human's feedback. The model iteratively improve the policy by sampling prompts p from dataset D and generations g from the policy \pi, and use the PPO algorithm to optimize loss function. $$ R(g|p) $$: Final reward function , $$ \tilde{R}_{c}(g|p) $$: The reward function we define, such as the piecewise combination of the safety (Rs) and helpfulness (Rh) in LLaMa 2 model, $$ \pi_{0} (g|p) $$: Original policy to generate Response g, given Prompt p, $$ \pi_{\theta} (g|p) $$: The policy we are optimizing with parameters \theta, $$ \beta $$: KL penalty term to prevent the policy from diverging from original policy too far.

Proximal Policy Optimization PPO

Equation

Latex Code

Have Fun

Introduction

Comments

Write Your Comment