Group Relative Policy Optimization GRPO

Tags: #AI #Machine Learning #NLP

Equation

$$\mathcal{J}_{GRPO}(\theta) = \mathbb{E}{[q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O|q)]} $$ $$ \frac{1}{G}\sum_{i=1}^G\frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left\{ \min \left[ \frac{\pi_\theta(o_{i,t} | q, o_{i,\lt t})}{\pi_{\theta_{old}}(o_{i,t} | q, o_{i,\lt t})} \hat{A}_{i,t}, \text{clip} \left( \frac{\pi_\theta(o_{i,t} | q, o_{i,\lt t})}{\pi_{\theta_{old}}(o_{i,t} | q, o_{i,\lt t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right] - \beta \mathbb{D}_{KL}\left[\pi_{\theta} || \pi_{ref}\right]\right\}$$

Latex Code

                                 \mathcal{J}_{GRPO}(\theta) = \mathbb{E}{[q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O|q)]} $$  

$$  \frac{1}{G}\sum_{i=1}^G\frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left\{ \min \left[ \frac{\pi_\theta(o_{i,t} | q, o_{i,\lt t})}{\pi_{\theta_{old}}(o_{i,t} | q, o_{i,\lt t})} \hat{A}_{i,t}, \text{clip} \left( \frac{\pi_\theta(o_{i,t} | q, o_{i,\lt t})}{\pi_{\theta_{old}}(o_{i,t} | q, o_{i,\lt t})}, 1 - \epsilon, 1 + \epsilon \right)  \hat{A}_{i,t} \right] - \beta \mathbb{D}_{KL}\left[\pi_{\theta} || \pi_{ref}\right]\right\}

Have Fun

Let's Vote for the Most Difficult Equation!

Introduction

$$ \text{GRPO} $$: Group Relative Policy Optimization is an RL algorithm to obviate the need for additional value function approximation as in PPO, and instead uses the average reward of multiple sampled outputs, produced in response to the same question, as the baseline. More specifically, for each question $$ q $$, GRPO samples a group of outputs $$ \{o_{1}, o_{2}, ..., o_{G}\} $$ from the old policy $$ \pi_{\theta}^{old} $$ and then optimizes the policy model by maximizing the reward.
$$ \hat{A}_{i,t} $$: advantage calculated based on relative rewards of the outputs inside each group only.
$$ \mathbb{D}_{KL}\left[\pi_{\theta} || \pi_{ref}\right] $$ : This term indicates the unbiased estimator of the KL divergence. $$ \mathbb{D}_{KL}\left[\pi_{\theta} || \pi_{ref}\right] = \frac{\pi_{ref}(o_{i,t}|q,o_{i,\lt t})}{\pi_{\theta}(o_{i,t}|q,o_{i,\lt t})}- \log\frac{\pi_{ref}(o_{i,t}|q,o_{i, lt t})}{\pi_{\theta}(o_{i,t}|q,o_{i,\lt t})} - 1 $$

Group Relative Policy Optimization GRPO

Equation

Latex Code

Have Fun

Introduction

Related

Comments

Write Your Comment