An approximate representation of hypercliques

作者:A. Koufakou

摘要

A hyperclique is an itemset containing items that are strongly correlated with each other, based on a user-specified threshold. Hypercliques (HC) have been successfully used in a number of applications, for example clustering and noise removal. Even though the HC collection has been shown to respond well to datasets with skewed support distribution and low support threshold, it may still grow very large for dense datasets and lower h-confidence threshold. Recently, we proposed a condensed representation of HC, the Non-Derivable Hypercliques (NDHC). NDHC was shown to have substantial advantages over HC, especially for dense datasets and lower h-confidence values. In this paper, we propose an approximation of the NDHC collection, called Approximate Non-Derivable Hypercliques (ANDHC). This collection is a subset of NDHC and is generated based on a user-input error. We present an efficient algorithm to mine all ANDHC sets, and an algorithm to approximately derive the HC collection from the ANDHC collection. Through experiments with several datasets and parameter combinations, the ANDHC collection is shown to have significant collection size reduction from NDHC even for small error values. It is also shown that based on ANDHC we can generate a good approximation of the entire HC collection. Finally, we experiment with applying our proposed collections to enhance clustering with encouraging results.

论文关键词:Hyperclique, Non-derivable itemset, Frequent itemset mining, Approximate hyperclique collection

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10844-016-0409-4