Group Relative Policy Optimization GRPO
Tags: #AI #Machine Learning #NLPEquation
$$\mathcal{J}_{GRPO}(\theta) = \mathbb{E}{[q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O|q)]} $$ $$ \frac{1}{G}\sum_{i=1}^G\frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left\{ \min \left[ \frac{\pi_\theta(o_{i,t} | q, o_{i,\lt t})}{\pi_{\theta_{old}}(o_{i,t} | q, o_{i,\lt t})} \hat{A}_{i,t}, \text{clip} \left( \frac{\pi_\theta(o_{i,t} | q, o_{i,\lt t})}{\pi_{\theta_{old}}(o_{i,t} | q, o_{i,\lt t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right] - \beta \mathbb{D}_{KL}\left[\pi_{\theta} || \pi_{ref}\right]\right\}$$Latex Code
\mathcal{J}_{GRPO}(\theta) = \mathbb{E}{[q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O|q)]} $$ $$ \frac{1}{G}\sum_{i=1}^G\frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left\{ \min \left[ \frac{\pi_\theta(o_{i,t} | q, o_{i,\lt t})}{\pi_{\theta_{old}}(o_{i,t} | q, o_{i,\lt t})} \hat{A}_{i,t}, \text{clip} \left( \frac{\pi_\theta(o_{i,t} | q, o_{i,\lt t})}{\pi_{\theta_{old}}(o_{i,t} | q, o_{i,\lt t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right] - \beta \mathbb{D}_{KL}\left[\pi_{\theta} || \pi_{ref}\right]\right\}
Have Fun
Let's Vote for the Most Difficult Equation!
Introduction
$$ \text{GRPO} $$: Group Relative Policy Optimization is an RL algorithm to obviate the need for additional value function approximation as in PPO, and instead uses the average reward of multiple sampled outputs, produced in response to the same question, as the baseline. More specifically, for each question $$ q $$, GRPO samples a group of outputs $$ \{o_{1}, o_{2}, ..., o_{G}\} $$ from the old policy $$ \pi_{\theta}^{old} $$ and then optimizes the policy model by maximizing the reward.
$$ \hat{A}_{i,t} $$: advantage calculated based on relative rewards of the outputs inside each group only.
$$ \mathbb{D}_{KL}\left[\pi_{\theta} || \pi_{ref}\right] $$ : This term indicates the unbiased estimator of the KL divergence.
$$ \mathbb{D}_{KL}\left[\pi_{\theta} || \pi_{ref}\right] = \frac{\pi_{ref}(o_{i,t}|q,o_{i,\lt t})}{\pi_{\theta}(o_{i,t}|q,o_{i,\lt t})}- \log\frac{\pi_{ref}(o_{i,t}|q,o_{i, lt t})}{\pi_{\theta}(o_{i,t}|q,o_{i,\lt t})} - 1 $$
Related
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models