A statistical perspective on distillation

Aditya K Menon,u00a0Ankit Singh Rawat,u00a0Sashank Reddi,u00a0Seungyeon Kim,u00a0Sanjiv Kumar

Knowledge distillation is a technique for improving a u201cstudentu201d model by replacing its one-hot training labels with a label distribution obtained from a u201cteacheru201d model. Despite its broad success, several basic questions u2014 e.g., Why does distillation help? Why do more accurate teachers not necessarily distill better? u2014 have received limited formal study. In this paper, we present a statistical perspective on distillation which provides an answer to these questions. Our core observation is that a u201cBayes teacheru201d providing the true class-probabilities can lower the variance of the student objective, and thus improve performance. We then establish a bias-variance tradeoff that quantifies the value of teachers that approximate the Bayes class-probabilities. This provides a formal criterion as to what constitutes a u201cgoodu201d teacher, namely, the quality of its probability estimates. Finally, we illustrate how our statistical perspective facilitates novel applications of distillation to bipartite ranking and multiclass retrieval.