A multi-view similarity measure framework for trouble ticket mining

作者:

Highlights:

摘要

Text similarity measures play a very important role in several text mining applications. Although there is an extensive literature on measuring the similarity between long texts, there is less work related to the measurement of similarity between short texts. And most of these works on short text similarity are based on adaptations of long-text similarity methods. Unfortunately, the description of a trouble ticket is just a kind of short texts. Thus, ticket mining applications such as ticket classification, ticket clustering, and ticket resolution recommendation often suffer from poor performance because of tickets’ particular characteristics of unstructured, short free-text with large vocabulary size, large volume, non-English dictionary words, and so on. Therefore, the ability to accurately measure the similarity between two tickets is critical to the performance of ticket mining.To address this performance issue, this paper proposes a multi-view similarity measure framework that easily integrates several kinds of existing similarity measures including surface matching based measures, semantic similarity measures and syntax based measures. Further, in order to make full use of the strengths of different similarity measures, our framework adopts four different policies to combine them. In particular, we consider a machine learning based policy that can be applied to integrate various similarity measures in a more general way, which makes our framework flexible and extensible. To demonstrate the effectiveness of measures generated from our framework, we empirically validate them on a publicly available short text data set and apply them to a real-world ticket data set from a large enterprise IT infrastructure. Some important findings obtained via the result analysis will be helpful to further improve performance.

论文关键词:Trouble ticket,Similarity measure,Semantic similarity,Machine learning

论文评审过程:Received 8 September 2018, Revised 10 November 2019, Accepted 8 February 2020, Available online 12 February 2020, Version of Record 28 May 2020.

论文官网地址:https://doi.org/10.1016/j.datak.2020.101800