End-to-End Supermask Pruning: Learning to Prune Image Captioning Models

Highlights：

• We investigate an ideal way to combine pruning with fine-tuning of pre-trained CNN, and show that both decoder pruning and training should be done before pruning the encoder.

• We release the pre-trained sparse models for UD and ORT that are capable of achieving CIDEr scores >120 on MS-COCO dataset; yet are only 8.7 MB (reduction of 96% compared to dense UD) and 14.5 MB (reduction of 94% compared to dense ORT) in model size. Our code and pre-trained models are publicly available at https://github.com/jiahuei/sparse-image-captioning

摘要

•This is the first extensive attempt at exploring model pruning for image captioning task. Empirically, we show that deep captioning networks with 80% to 95% sparse are capable to either match or even slightly outperform their dense counterparts. In addition, we propose a pruning method - Supermask Pruning (SMP) that performs continuous and gradual sparsification during training stage based on parameter sensitivity in an end-to-end fashion.•We investigate an ideal way to combine pruning with fine-tuning of pre-trained CNN, and show that both decoder pruning and training should be done before pruning the encoder.•We release the pre-trained sparse models for UD and ORT that are capable of achieving CIDEr scores >120 on MS-COCO dataset; yet are only 8.7 MB (reduction of 96% compared to dense UD) and 14.5 MB (reduction of 94% compared to dense ORT) in model size. Our code and pre-trained models are publicly available at https://github.com/jiahuei/sparse-image-captioning