Item response theory in AI: Analysing machine learning classifiers at the instance level

作者:

摘要

AI systems are usually evaluated on a range of problem instances and compared to other AI systems that use different strategies. These instances are rarely independent. Machine learning, and supervised learning in particular, is a very good example of this. Given a machine learning model, its behaviour for a single instance cannot be understood in isolation but rather in relation to the rest of the data distribution or dataset. In a dual way, the results of one machine learning model for an instance can be analysed in comparison to other models. While this analysis is relative to a population or distribution of models, it can give much more insight than an isolated analysis. Item response theory (IRT) combines this duality between items and respondents to extract latent variables of the items (such as discrimination or difficulty) and the respondents (such as ability). IRT can be adapted to the analysis of machine learning experiments (and by extension to any other artificial intelligence experiments). In this paper, we see that IRT suits classification tasks perfectly, where instances correspond to items and classifiers correspond to respondents. We perform a series of experiments with a range of datasets and classification methods to fully understand what the IRT parameters such as discrimination, difficulty and guessing mean for classification instances (and their relation to instance hardness measures) and how the estimated classifier ability can be used to compare classifier performance in a different way through classifier characteristic curves.

论文关键词:Artificial intelligence evaluation,Item response theory,Machine learning,Instance hardness,Classifier metrics

论文评审过程:Received 4 March 2017, Revised 13 September 2018, Accepted 20 September 2018, Available online 31 January 2019, Version of Record 6 February 2019.

论文官网地址:https://doi.org/10.1016/j.artint.2018.09.004