Proximal Policy Optimization
Tags: #machine learning #AI #LLMEquation
$$\arg\max\limits_{\pi}{ E_{p \sim D,g \sim \pi} [R(g|p)] }, R(g|p) = \tilde{R}_{c}(g|p) - \beta D_{KL}( \pi_{\theta} (g|p) || \pi_{0} (g|p))$$Latex Code
\arg\max\limits_{\pi}{ E_{p \sim D,g \sim \pi} [R(g|p)] }, R(g|p) = \tilde{R}_{c}(g|p) - \beta D_{KL}( \pi_{\theta} (g|p) || \pi_{0} (g|p))
Have Fun
Let's Vote for the Most Difficult Equation!
Introduction
Proximal Policy Optimization (PPO) is an optimization algorithm of reinforcement learning (RLHF) to train the reward model during the stage of finetuning a Larget Language Model (LLM). It helps to better align the LLM's generation to judgement and human's feedback. The model iteratively improve the policy by sampling prompts p from dataset D and generations g from the policy \pi, and use the PPO algorithm to optimize loss function. $$ R(g|p) $$: Final reward function , $$ \tilde{R}_{c}(g|p) $$: The reward function we define, such as the piecewise combination of the safety (Rs) and helpfulness (Rh) in LLaMa 2 model, $$ \pi_{0} (g|p) $$: Original policy to generate Response g, given Prompt p, $$ \pi_{\theta} (g|p) $$: The policy we are optimizing with parameters \theta, $$ \beta $$: KL penalty term to prevent the policy from diverging from original policy too far.