Learning credible DNNs via incorporating prior knowledge and model local explanation

摘要

Recent studies have shown that state-of-the-art DNNs are not always credible, despite their impressive performance on the hold-out test set of a variety of tasks. These models tend to exploit dataset shortcuts to make predictions, rather than learn the underlying task. The non-credibility could lead to low generalization, adversarial vulnerability, as well as algorithmic discrimination of the DNN models. In this paper, we propose CREX in order to develop more credible DNNs. The high-level idea of CREX is to encourage DNN models to focus more on evidences that actually matter for the task at hand and to avoid overfitting to data-dependent shortcuts. Specifically, in the DNN training process, CREX directly regularizes the local explanation with expert rationales, i.e., a subset of features highlighted by domain experts as justifications for predictions, to enforce the alignment between local explanations and rationales. Even when rationales are not available, CREX still could be useful by requiring the generated explanations to be sparse. In addition, CREX is widely applicable to different network architectures, including CNN, LSTM and attention model. Experimental results on several text classification datasets demonstrate that CREX could increase the credibility of DNNs. Comprehensive analysis further shows three meaningful improvements of CREX: (1) it significantly increases DNN accuracy on new and previously unseen data beyond test set, (2) it enhances fairness of DNNs in terms of equality of opportunity metric and reduce models’ discrimination toward certain demographic group, and (3) it promotes the robustness of DNN models with respect to adversarial attack. These experimental results highlight the advantages of the increased credibility by CREX.