Adversarial example detection based on saliency map features

作者：Shen Wang, Yuxin Gong

摘要

In recent years, machine learning has greatly improved image recognition capability. However, studies have shown that neural network models are vulnerable to adversarial examples that make models output wrong answers with high confidence. To understand the vulnerabilities of models, we use interpretability methods to reveal the internal decision-making behaviors of models. Interpretation results reflect that the evolutionary process of nonnormalized saliency maps between clean and adversarial examples are increasingly differentiated along model hidden layers. By taking advantage of this phenomenon, we propose an adversarial example detection method based on multilayer saliency features, which can comprehensively capture the abnormal characteristics of adversarial example interpretations. Experimental results show that the proposed method can effectively detect adversarial examples based on gradient, optimization and black-box attacks, and it is comparable with the state-of-the-art methods.

论文关键词：Machine learning, Adversarial example detection, Interpretability, Saliency map

论文评审过程：

论文官网地址：https://doi.org/10.1007/s10489-021-02759-8