Odds Ratio Preference Optimization ORPO
Tags: #AI #nlp #llm #RLHFEquation
$$\mathcal{L}_{ORPO} = \mathbb{E}_{(x, y_w, y_l)}\left[ \mathcal{L}_{SFT} + \lambda \cdot \mathcal{L}_{OR} \right] $$ $$ \mathcal{L}_{OR} = -\log \sigma \left( \log \frac{\textbf{odds}_\theta(y_w|x)}{\textbf{odds}_\theta(y_l|x)} \right) $$Latex Code
\mathcal{L}_{ORPO} = \mathbb{E}_{(x, y_w, y_l)}\left[ \mathcal{L}_{SFT} + \lambda \cdot \mathcal{L}_{OR} \right] $$ $$ \mathcal{L}_{OR} = -\log \sigma \left( \log \frac{\textbf{odds}_\theta(y_w|x)}{\textbf{odds}_\theta(y_l|x)} \right)
Have Fun
Let's Vote for the Most Difficult Equation!
Introduction
The objective function of ORPO in Equation 6 consists of two components: 1) supervised fine-tuning (SFT) loss (LSF T ); 2) relative ratio loss (LOR).
Together, LSF T and LOR weighted with λ tailor the pre-trained language model to adapt to the specific subset of the desired domain and disfavor generations in the rejected response sets.
paper: ORPO: Monolithic Preference Optimization without Reference Model
huggingface: ORPO Trainer