Lightweight surrogate random forest support for model simplification and feature relevance

摘要

In this study, we propose a lightweight surrogate random forest (L-SRF) algorithm that can be interpreted through a new rule distillation method. The common surrogate models replace the existing heavy and deep but high-performance black box model using a teacher–student learning framework. However, the student model obtained in this way must maintain the performance of the teacher model, and thus the degree of model simplification and transparency is extremely limited. Therefore, to increase model transparency while maintaining the performance of the surrogate model, we propose two methods. First, we propose a cross-entropy Shapley value to evaluate the contribution of each rule in the student surrogate model. Second, a random mini-grouping method is devised to effectively distilless important rules while minimizing the overfitting problem caused by a model simplification. The proposed L-SRF based on a rule contribution has the advantage of improving the degree of simplification and transparency of the model by realizing the large distillation ratio against the initial SRF model. In addition, because the proposed L-SRF removes unnecessary rules, it is possible to minimize the loss of the importance and relevance of each feature. To demonstrate the superior performance of the proposed L-SRF method, several comparative experiments were conducted on various data sets. We proved experimentally that the proposed method achieves a more effective performance than black box AI models in terms of model transparency and memory requirement, as well as the interpretation of the feature relevance.