On anonymizing transactions with sensitive items

摘要

K-anonymity (Samarati and Sweeny 1998; Samarati, IEEE Trans Knowl Data Eng, 13(6):1010–1027, 2001; Sweeny, Int J Uncertain, Fuzziness Knowl-Based Syst, 10(5):557–570, 2002) and its variants, l-diversity (Machanavajjhala et al., ACM TKDD, 2007) and tcloseness (Li et al. 2007) among others are anonymization techniques for relational data and transaction data, which are used to protect privacy against re-identification attacks. A relational dataset D is k-anonymous if every record in D has at least k-1 other records with identical quasi-identifier attribute values. The combination of released data with external data will never allow the recipient to associate each released record with less than k individuals (Samarati, IEEE Trans Knowl Data Eng, 13(6):1010–1027, 2001). However, the current concept of k-anonymity on transaction data treats all items as quasi-identifiers. The anonymized data set has k identical transactions in groups and suffers from lower data utility (He and Naughton 2009; He et al. 2011; Liu and Wang 2010; Terrovitis et al., VLDB J, 20(1):83–106, 2011; Terrovitis et al. 2008). To improve the utility of anonymized transaction data, this work proposes a novel anonymity concept on transaction data that contain both quasi-identifier items (QID) and sensitive items (SI). A transaction that contains sensitive items must have at least k-1 other identical transactions (Ghinita et al. IEEE TKDE, 33(2):161–174, 2011; Xu et al. 2008). For a transaction that does not contain a sensitive item, no anonymization is required. A transaction dataset that satisfies this property is said to be sensitive k-anonymous. Three algorithms, Sensitive Transaction Neighbors (STN) Gray Sort Clustering (GSC) and Nearest Neighbors for K-anonymization (K-NN), are developed. These algorithms use adding/deleting QID items and only adding SI to achieve sensitive k-anonymity on transaction data. Additionally, a simple “privacy value” is proposed to evaluate the degree of privacy for different types of k-anonymity on transaction data. Extensive numerical simulations were carried out to demonstrate the characteristics of the proposed algorithms and also compared to other types of k-anonymity approaches. The results show that each technique possesses its own advantage under different criteria such as running time, operation, and information loss. The results obtained here can be used as a guideline of the selection of anonymization technique on different data sets and for different applications.