Odds Ratio Preference Optimization ORPO

Tags: #AI #nlp #llm #RLHF

Equation

$$\mathcal{L}_{ORPO} = \mathbb{E}_{(x, y_w, y_l)}\left[ \mathcal{L}_{SFT} + \lambda \cdot \mathcal{L}_{OR} \right] $$ $$ \mathcal{L}_{OR} = -\log \sigma \left( \log \frac{\textbf{odds}_\theta(y_w|x)}{\textbf{odds}_\theta(y_l|x)} \right) $$

Latex Code

                                 \mathcal{L}_{ORPO} = \mathbb{E}_{(x, y_w, y_l)}\left[ \mathcal{L}_{SFT} + \lambda \cdot \mathcal{L}_{OR} \right] $$ $$ 


\mathcal{L}_{OR} = -\log \sigma \left( \log \frac{\textbf{odds}_\theta(y_w|x)}{\textbf{odds}_\theta(y_l|x)} \right)

Have Fun

Let's Vote for the Most Difficult Equation!

Introduction

The objective function of ORPO in Equation 6 consists of two components: 1) supervised fine-tuning (SFT) loss (LSF T ); 2) relative ratio loss (LOR). Together, LSF T and LOR weighted with λ tailor the pre-trained language model to adapt to the specific subset of the desired domain and disfavor generations in the rejected response sets.
paper: ORPO: Monolithic Preference Optimization without Reference Model
huggingface: ORPO Trainer

Odds Ratio Preference Optimization ORPO

Equation

Latex Code

Have Fun

Introduction

Comments

Write Your Comment