RLHF Reinforcement Learning from Human Feedback
Tags: #nlp #LLM #equationEquation
$$p^*(y_w \succ y_l|x) = \sigma(r^*(x,y_w) - r^*(x,y_l)) $$ $$ \mathcal{L}_R(r_\phi) = \mathbb{E}_{x,y_w,y_l \sim D}[- \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))] $$ $$ \mathbb{E}_{x \in D, y \in \pi_\theta} [r_\phi(x,y)] - \beta D_{\text{KL}}(\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x)) $$Latex Code
p^*(y_w \succ y_l|x) = \sigma(r^*(x,y_w) - r^*(x,y_l)) $$ $$ \mathcal{L}_R(r_\phi) = \mathbb{E}_{x,y_w,y_l \sim D}[- \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))] $$ $$ \mathbb{E}_{x \in D, y \in \pi_\theta} [r_\phi(x,y)] - \beta D_{\text{KL}}(\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x))
Have Fun
Let's Vote for the Most Difficult Equation!
Introduction
Given a dataset $$\mathcal{D}$$ of preferences $$(x, y_w, y_l)$$ where $$x$$ is an input, $$y_w, y_l$$ are the preferred and dispreferred outputs (i.e., $$y_w \succ y_l$$ for $$x$$), and $$r^*$$ is the “true” reward function underlying the preferences. Getting the true reward from a human would be intractably expensive, reward model r(.) learns to be a proxy.
$$\mathcal{D}$$: Dataset of Human Preference Data $$(x, y_w, y_l)$$
$$x$$: Input text
$$y_w$$: Preferred output
$$y_l$$: Dispreferred outputs
$$r^*$$: Reward function underlying the preferences
$$ \mathcal{L}_R(r_\phi) $$: Minimizing the negative log-likelihood of the human preference data.
$$ \beta $$: Hyper-parameters of penalty,
$$ D_{\text{KL}}(\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x) $$: KL divergence penalty to restrict how far the language model $$ \pi_{\theta} $$ can drift from reference model $$ \pi_{\text{ref}} $$