Contrastive Preference Optimization CPO

Tags: #AI #nlp #llm #RLHF

Equation

$$\mathcal{L}(\pi_\theta;\pi_{\text{ref}}) = -\mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}} \Big[ \log \sigma \Big( \beta \log \frac{\pi_{\theta}(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \beta \log \frac{\pi_{\theta}(y_l | x)}{\pi_{\text{ref}} (y_l | x)} \Big) \Big] $$ $$ \mathcal{L}(\pi_\theta;U) = -\mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}} \Big[ \log \sigma \Big( \beta \log \pi_{\theta}(y_w | x) \nonumber \\ - \beta \log \pi_{\theta}(y_l | x) \Big) \Big] $$ $$ \min_\theta \mathcal{L}(\pi_\theta, U) \notag \text{ s.t. } \mathbb{E}_{(x,y_w) \sim \mathcal{D}}\Big [ \mathbb{KL}(\pi_w(y_w|x)||\pi_\theta(y_w|x))\Big] < \epsilon $$ $$ \min_\theta\underbrace{ \mathcal{L}(\pi_\theta, U)}_{\mathcal{L}_\text{prefer}} \underbrace{-\mathbb{E}_{(x,y_w) \sim \mathcal{D}} [\log \pi_\theta(y_w| x)]}_{\mathcal{L}_\text{NLL}} $$

Latex Code

                                 \mathcal{L}(\pi_\theta;\pi_{\text{ref}}) =  -\mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}} \Big[ \log \sigma \Big( \beta \log \frac{\pi_{\theta}(y_w | x)}{\pi_{\text{ref}}(y_w | x)}  - \beta \log \frac{\pi_{\theta}(y_l | x)}{\pi_{\text{ref}} (y_l | x)} \Big) \Big] 
 $$ $$ 
\mathcal{L}(\pi_\theta;U) =   -\mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}} \Big[ \log \sigma \Big( \beta \log \pi_{\theta}(y_w | x) \nonumber \\ - \beta \log \pi_{\theta}(y_l | x) \Big) \Big] $$ $$ 

 \min_\theta \mathcal{L}(\pi_\theta, U) \notag \text{  s.t.  } \mathbb{E}_{(x,y_w) \sim \mathcal{D}}\Big [ \mathbb{KL}(\pi_w(y_w|x)||\pi_\theta(y_w|x))\Big] < \epsilon

$$ $$ 

\min_\theta\underbrace{ \mathcal{L}(\pi_\theta, U)}_{\mathcal{L}_\text{prefer}} \underbrace{-\mathbb{E}_{(x,y_w) \sim \mathcal{D}} [\log \pi_\theta(y_w| x)]}_{\mathcal{L}_\text{NLL}}

Have Fun

Let's Vote for the Most Difficult Equation!

Introduction

Contrastive Preference Optimization (CPO), a novel approach that trains models to avoid generating adequate but not perfect translations. CPO loss, which includes one preference learning term $$\mathcal{L}_{\text{prefer}}$$ and one negative log likelihood term $$\mathcal{L}_{\text{NLL}}$$.
paper: Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation
huggingface: CPO Trainer

Contrastive Preference Optimization CPO

Equation

Latex Code

Have Fun

Introduction

Comments

Write Your Comment