Context-aware incremental clustering of alerts in monitoring systems

作者:

Highlights:

摘要

The highly complex nature of today’s modern hybrid IT applications continues to present an increasing challenge for operation teams relying on traditional monitoring approaches. In monitoring systems, incidents occur frequently due to a variety of causes, from updates to software and hardware, to changes in operation environment. These incidents could significantly degrade the system’s availability and customers’ satisfaction. In many cases, investigating an incident in such an environment could feel like looking for a needle in a haystack - and you may not even know how the needle looks like until you see it. In that regard, one of the main challenges is how to efficiently analyze multiple sets of alert messages stemming from disparate monitoring tools and collectors across the application stack, in real-time. Such an analysis can provide trustworthy detection of system states at various critical points, thus helping teams to detect, frame, analyze and resolve incidents or failures in a relatively short time, especially if an accurate system’s topological dependencies are absent. In this work, we suggest a new approach to determining relations among alerts – forming “events”. The suggested approach directly models the event’s likelihood, by first embedding alerts’ corresponding metrics into a common latent space where the distance among metrics can be naturally defined, using a word2vec model, and then cluster alerts by employing a tailored incremental clustering algorithm. The suggested approach allows controlling the trade-off between the model’s sensitivity to clusters’ noise-robustness, thus spanning a wide range of clustering mechanisms, as well as adapting clusters’ outcomes to the level and properties of the noise expected in input data.

论文关键词:Monitoring,Alerts,Metric ID,Embedding,Clustering,Skip-gram,Pair-wise similarity,Negative sampling

论文评审过程:Received 20 June 2021, Revised 31 January 2022, Accepted 7 August 2022, Available online 13 August 2022, Version of Record 24 August 2022.

论文官网地址:https://doi.org/10.1016/j.eswa.2022.118489