Improving knowledge distillation via an expressive teacher

作者：

Highlights：

•

摘要

Knowledge distillation (KD) is a widely used network compression technique for seeking a light student network with similar behaviors to its heavy teacher network. Previous studies mainly focus on training the student to mimic representation space of the teacher. However, how to be a good teacher is rarely explored. We find that if a teacher has weak ability to capture the knowledge underlying the true data in the real world, the student cannot even learn knowledge from its teacher. Inspired by that, we propose an inter-class correlation regularization to train teacher to capture a more explicit correlation among classes. Besides, we enforce student to mimic inter-class correlation of its teacher. Extensive experiments of image classification task have been conducted on four public benchmarks. For example, when the teacher and student networks are ShuffleNetV2-1.0 and ShuffleNetV2-0.5, our proposed method achieves 42.63% top-1 error rate for Tiny ImageNet.

论文关键词：Neural network compression,Knowledge distillation,Knowledge transfer

论文评审过程：Received 15 October 2020, Revised 8 January 2021, Accepted 3 February 2021, Available online 12 February 2021, Version of Record 20 February 2021.

论文官网地址：https://doi.org/10.1016/j.knosys.2021.106837