Energy-based anomaly detection for mixed data

作者:Kien Do, Truyen Tran, Svetha Venkatesh

摘要

Anomalies are those deviating significantly from the norm. Thus, anomaly detection amounts to finding data points located far away from their neighbors, i.e., those lying in low-density regions. Classic anomaly detection methods are largely designed for single data type such as continuous or discrete. However, real-world data is increasingly heterogeneous, where a data point can have both discrete and continuous attributes. Mixed data poses multiple challenges including (a) capturing the inter-type correlation structures and (b) measuring deviation from the norm under multiple types. These challenges are exaggerated under (c) high-dimensional regimes. In this paper, we propose a new scalable unsupervised anomaly detection method for mixed data based on Mixed-variate Restricted Boltzmann Machine (Mv.RBM). The Mv.RBM is a principled probabilistic method that estimates density of mixed data. We propose to use free energy derived from Mv.RBM as anomaly score as it is identical to data negative log-density up to an additive constant. We then extend this method to detect anomalies across multiple levels of data abstraction, an effective approach to deal with high-dimensional settings. The extension is dubbed \(\mathtt {MIXMAD}\), which stands for MIXed data Multilevel Anomaly Detection. In \(\mathtt {MIXMAD}\), we sequentially construct an ensemble of mixed-data Deep Belief Nets (DBNs) with varying depths. Each DBN is an energy-based detector at a predefined abstraction level. Predictions across the ensemble are finally combined via a simple rank aggregation method. The proposed methods are evaluated on a comprehensive suit of synthetic and real high-dimensional datasets. The results demonstrate that for anomaly detection, (a) a proper handling of mixed types is necessary, (b) free energy is a powerful anomaly scoring method, (c) multilevel abstraction of data is important for high-dimensional data, and (d) empirically Mv.RBM and \(\mathtt {MIXMAD}\) are superior to popular unsupervised detection methods for both homogeneous and mixed data.

论文关键词:Mixed data, Mixed-variate restricted Boltzmann machine, Deep belief net, Multilevel anomaly detection

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10115-018-1168-z