RLHF Reinforcement Learning from Human Feedback

Tags: #nlp #LLM #equation

Equation

$$p^*(y_w \succ y_l|x) = \sigma(r^*(x,y_w) - r^*(x,y_l)) $$ $$ \mathcal{L}_R(r_\phi) = \mathbb{E}_{x,y_w,y_l \sim D}[- \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))] $$ $$ \mathbb{E}_{x \in D, y \in \pi_\theta} [r_\phi(x,y)] - \beta D_{\text{KL}}(\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x)) $$

Latex Code

                                 p^*(y_w \succ y_l|x) = \sigma(r^*(x,y_w) - r^*(x,y_l)) $$ $$
\mathcal{L}_R(r_\phi) = \mathbb{E}_{x,y_w,y_l \sim D}[- \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))] $$ $$
\mathbb{E}_{x \in D, y \in \pi_\theta} [r_\phi(x,y)] - \beta D_{\text{KL}}(\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x))

                            

Have Fun

Let's Vote for the Most Difficult Equation!

Introduction

Given a dataset $$\mathcal{D}$$ of preferences $$(x, y_w, y_l)$$ where $$x$$ is an input, $$y_w, y_l$$ are the preferred and dispreferred outputs (i.e., $$y_w \succ y_l$$ for $$x$$), and $$r^*$$ is the “true” reward function underlying the preferences. Getting the true reward from a human would be intractably expensive, reward model r(.) learns to be a proxy. $$\mathcal{D}$$: Dataset of Human Preference Data $$(x, y_w, y_l)$$
$$x$$: Input text
$$y_w$$: Preferred output
$$y_l$$: Dispreferred outputs
$$r^*$$: Reward function underlying the preferences
$$ \mathcal{L}_R(r_\phi) $$: Minimizing the negative log-likelihood of the human preference data. $$ \beta $$: Hyper-parameters of penalty, $$ D_{\text{KL}}(\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x) $$: KL divergence penalty to restrict how far the language model $$ \pi_{\theta} $$ can drift from reference model $$ \pi_{\text{ref}} $$

Related

Training language models to follow instructions with human feedback

Comments

Write Your Comment

Upload Pictures and Videos