Direct Policy Optimization DPO

Tags: #nlp #llm #RLHF

Equation

$$\pi_{r} (y|x) = \frac{1}{Z(x)} \pi_{ref} (y|x) \exp(\frac{1}{\beta} r(x,y) ) , r(x,y) = \beta \log \frac{\pi_{r} (y|x)}{\pi_{ref} (y|x)} + \beta \log Z(x) , p^{*}(y_{1} > y_{2} |x) = \frac{1}{1+\exp{(\beta \frac{\pi^{*} (y_{2}|x)}{\pi_{ref} (y_{2}|x)} - \beta \frac{\pi^{*} (y_{1}|x)}{\pi_{ref} (y_{1}|x)} )}} , \mathcal{L}_{DPO}(\pi_{\theta};\pi_{ref}) = -\mathbb{E}_{(x, y_{w},y_{l}) \sim D } [\log \sigma (\beta \frac{\pi_{\theta} (y_{w}|x)}{\pi_{ref} (y_{w}|x)} - \beta \frac{\pi_{\theta} (y_{l}|x)}{\pi_{ref} (y_{l}|x)} )] , \nabla \mathcal{L}_{DPO}(\pi_{\theta};\pi_{ref}) = - \beta \mathbb{E}_{(x, y_{w},y_{l}) \sim D } [ \sigma ( \hat{r}_{\theta} (x, y_{l}) - \hat{r}_{\theta} (x, y_{w})) [\nabla_{\theta} \log \pi (y_{w}|x) - \nabla_{\theta} \log \pi (y_{l}|x) ] ] , \hat{r}_{\theta} (x, y) = \beta \log (\frac{\pi_{\theta} (y|x)}{\pi_{ref} (y|x)})$$

Latex Code

                                 \pi_{r} (y|x) = \frac{1}{Z(x)} \pi_{ref} (y|x) \exp(\frac{1}{\beta} r(x,y) ) ,

r(x,y) = \beta \log \frac{\pi_{r} (y|x)}{\pi_{ref} (y|x)} + \beta \log Z(x) ,

p^{*}(y_{1} > y_{2} |x) = \frac{1}{1+\exp{(\beta \frac{\pi^{*} (y_{2}|x)}{\pi_{ref} (y_{2}|x)} - \beta \frac{\pi^{*} (y_{1}|x)}{\pi_{ref} (y_{1}|x)}  )}} ,

\mathcal{L}_{DPO}(\pi_{\theta};\pi_{ref}) = -\mathbb{E}_{(x, y_{w},y_{l}) \sim D } [\log \sigma (\beta \frac{\pi_{\theta} (y_{w}|x)}{\pi_{ref} (y_{w}|x)} - \beta \frac{\pi_{\theta} (y_{l}|x)}{\pi_{ref} (y_{l}|x)} )] ,

\nabla \mathcal{L}_{DPO}(\pi_{\theta};\pi_{ref}) = - \beta \mathbb{E}_{(x, y_{w},y_{l}) \sim D } [ \sigma ( \hat{r}_{\theta} (x, y_{l}) - \hat{r}_{\theta} (x, y_{w})) [\nabla_{\theta} \log \pi (y_{w}|x) - \nabla_{\theta} \log \pi (y_{l}|x) ] ] ,

\hat{r}_{\theta} (x, y) = \beta \log (\frac{\pi_{\theta} (y|x)}{\pi_{ref} (y|x)})
                            

Have Fun

Let's Vote for the Most Difficult Equation!

Introduction

$$ \mathcal{L}_{DPO} $$: denotes the loss function of Direct Policy Optimization.
$$ \nabla \mathcal{L}_{DPO} $$: Gradient update of DPO Loss to parameter. $$\theta$$
$$ r(x,y) $$ : denotes the true reward function.
$$ \pi_{\theta}(.) $$: Function of updated language model.
$$ \pi_{ref}(.) $$: Function of reference language model.
$$ \hat{r}(x,y) $$ : denotes the reward function defined by updated language model function $$ \pi_{\theta}(.) $$ and the reference language model function $$ \pi_{ref}(.) $$
$$ \pi_{r} (y|x) $$: optimal solution to the KL-constrained reward maximization objective.
$$ Z(x) $$: denotes the partition function.
$$ p^{*}(y_{1} > y_{2} |x) $$: denotes the preference model give y_{1} is preferred than y_{2} give input x, under the Bradley-Terry model.
$$ \pi^{*} $$: Optimal Reinforcement Learning from Human Feedback (RLHF) policy.

Reference
Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Comments

Write Your Comment

Upload Pictures and Videos