A new approach for generating efficient sample from market basket data

摘要

Classical data mining algorithms require expensive passes over the entire database to generate frequent items and hence to generate association rules. With the increase in the size of database, it is becoming very difficult to handle large amount of data for computation. One of the solutions to this problem is to generate sample from the database that acts as representative of the entire database for finding association rules in such a way that the distance of the sample from the complete database is minimal. Choosing correct sample that could represent data is not an easy task. Many algorithms have been proposed in the past. Some of them are computationally fast while others give better accuracy. In this paper, we present an algorithm for generating a sample from the database that can replace the entire database for generating association rules and is aimed at keeping a balance between accuracy and speed. The algorithm that is proposed takes into account the average number of small, medium and large 1-itemset in the database and average weight of the transactions to define threshold condition for the transactions. Set of transactions that satisfy the threshold condition is chosen as the representative for the entire database. The effectiveness of the proposed algorithm has been tested over several runs of database generated by IBM synthetic data generator. A vivid comparative performance evaluation of the proposed technique with the existing sampling techniques for comparing the accuracy and speed has also been carried out.