How to identify early defaults in online lending: A cost-sensitive multi-layer learning framework

作者：

Highlights：

•

摘要

Credit scoring tools are frequently used by lenders to identify bad borrowers who cannot fully repay their liabilities. This is a classical problem of classification with imbalanced samples, where bad loans only take a small proportion of all applications. Various machine learning techniques have been applied to the prediction of default in the past few decades. In this paper, we aim to capture those early defaulted borrowers who are likely to be fraudsters on the online lending platform by using a multi-layer structured Gradient Boosted Decision Trees with Light Gradient Boosting Machines (ML-LightGBM). Due to the extremely imbalanced sample distribution and the costs of misclassification, we further apply a cost-sensitive framework to the loss function of classification models, in order to improve predictive accuracy. The empirical results, based on a sample of 1.6 million online loans, show that the proposed cost-sensitive ML-LightGBM algorithm outperforms other predictive models. This suggests that the cost-sensitive based ML-LightGBM is a promising technique for fraud detection and credit scoring.

论文关键词：Ensemble learning,Credit scoring,Imbalanced classification,Early default,Fraud detection

论文评审过程：Received 20 October 2020, Revised 12 March 2021, Accepted 15 March 2021, Available online 18 March 2021, Version of Record 24 March 2021.

论文官网地址：https://doi.org/10.1016/j.knosys.2021.106963