Generalized Knowledge Distillation GKD

Tags: #AI #nlp #llm #RLHF

Equation

$$L_\mathrm{GKD}(\theta) := (1 - \lambda) \mathbb{E}_{(x, y) \sim (X, Y)} \big[ \mathcal{D}(p_{T} \| p_{S}^\theta)(y|x) \big] + \lambda \mathbb{E}_{x\sim X} \Big[\mathbb{E}_{y \sim p_{S} (\cdot|x)} \big[\mathcal{D}(p_{T} \| p_{S}^\theta)(y|x)\big]\Big] $$

Latex Code

                                 L_\mathrm{GKD}(\theta) := (1 - \lambda)  \mathbb{E}_{(x, y) \sim (X, Y)} \big[ \mathcal{D}(p_{T} \| p_{S}^\theta)(y|x) \big] + \lambda \mathbb{E}_{x\sim X} \Big[\mathbb{E}_{y \sim p_{S} (\cdot|x)} \big[\mathcal{D}(p_{T} \| p_{S}^\theta)(y|x)\big]\Big]

                            

Have Fun

Let's Vote for the Most Difficult Equation!

Introduction

We unify supervised and on-policy approaches and propose a more general approach, which we call Generalized KD~($$ \textbf{GKD} $$). In GKD, we can choose both the divergence to optimize as well as the output sequences to train on. Specifically, we can optimize any divergence between the teacher and student token-level probability distributions. For output sequences, GKD uses a mixture of fixed dataset, either teacher-generated or ground-truth, and on-policy student-generated sequences. Abstractly, GKD minimizes an objective of the form, where $\mathcal{D}(p_{T}, p_{S})(y|x)$ is a divergence between teacher and student distributions, and $\lambda \in [0, 1]$ is a hyper-parameter that controls the \emph{student data fraction}, that is, the fraction of on-policy student-generated outputs. Akin to on-policy KD, we do not backpropagate gradients through the student's sampling process. On-policy and supervised KD are instantiations of GKD with divergence $\D$ set to forward KL and student data fractions $\lambda$ to $1$ and $0$ respectively. That said, GKD allows for other choices for the fraction $\lambda$ and the divergence, which we explore in this work.
paper: ON-POLICY DISTILLATION OF LANGUAGE MODELS: LEARNING FROM SELF-GENERATED MISTAKES

Comments

Write Your Comment

Upload Pictures and Videos