Contrastive Preference Optimization CPO
Tags: #AI #nlp #llm #RLHFEquation
$$\mathcal{L}(\pi_\theta;\pi_{\text{ref}}) = -\mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}} \Big[ \log \sigma \Big( \beta \log \frac{\pi_{\theta}(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \beta \log \frac{\pi_{\theta}(y_l | x)}{\pi_{\text{ref}} (y_l | x)} \Big) \Big] $$ $$ \mathcal{L}(\pi_\theta;U) = -\mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}} \Big[ \log \sigma \Big( \beta \log \pi_{\theta}(y_w | x) \nonumber \\ - \beta \log \pi_{\theta}(y_l | x) \Big) \Big] $$ $$ \min_\theta \mathcal{L}(\pi_\theta, U) \notag \text{ s.t. } \mathbb{E}_{(x,y_w) \sim \mathcal{D}}\Big [ \mathbb{KL}(\pi_w(y_w|x)||\pi_\theta(y_w|x))\Big] < \epsilon $$ $$ \min_\theta\underbrace{ \mathcal{L}(\pi_\theta, U)}_{\mathcal{L}_\text{prefer}} \underbrace{-\mathbb{E}_{(x,y_w) \sim \mathcal{D}} [\log \pi_\theta(y_w| x)]}_{\mathcal{L}_\text{NLL}} $$Latex Code
                                 \mathcal{L}(\pi_\theta;\pi_{\text{ref}}) =  -\mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}} \Big[ \log \sigma \Big( \beta \log \frac{\pi_{\theta}(y_w | x)}{\pi_{\text{ref}}(y_w | x)}  - \beta \log \frac{\pi_{\theta}(y_l | x)}{\pi_{\text{ref}} (y_l | x)} \Big) \Big] 
 $$ $$ 
\mathcal{L}(\pi_\theta;U) =   -\mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}} \Big[ \log \sigma \Big( \beta \log \pi_{\theta}(y_w | x) \nonumber \\ - \beta \log \pi_{\theta}(y_l | x) \Big) \Big] $$ $$ 
 \min_\theta \mathcal{L}(\pi_\theta, U) \notag \text{  s.t.  } \mathbb{E}_{(x,y_w) \sim \mathcal{D}}\Big [ \mathbb{KL}(\pi_w(y_w|x)||\pi_\theta(y_w|x))\Big] < \epsilon
$$ $$ 
\min_\theta\underbrace{ \mathcal{L}(\pi_\theta, U)}_{\mathcal{L}_\text{prefer}} \underbrace{-\mathbb{E}_{(x,y_w) \sim \mathcal{D}} [\log \pi_\theta(y_w| x)]}_{\mathcal{L}_\text{NLL}}
                            
                        Have Fun
Let's Vote for the Most Difficult Equation!
Introduction
                            Contrastive Preference Optimization (CPO), a novel approach that trains models to avoid generating adequate but not perfect
translations. CPO loss, which includes one preference learning term $$\mathcal{L}_{\text{prefer}}$$ and one negative log likelihood term $$\mathcal{L}_{\text{NLL}}$$.
paper: Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation 
huggingface: CPO Trainer