Improving parallelism of federated query processing

摘要

Many large enterprises require access to distributed databases for business intelligence (BI) applications. Typically, distributed database are integrated into a centralized data warehouse for the benefit of easy maintenance. However, this approach needs to overcome the complexity of data loading and job scheduling as well as scalability issues. On the other hand, the approach of a fully federated system may not be feasible for data-intensive BI applications. The hybrid approach via intelligent data placement is more flexible and applicable than centralized or full-federation configurations. The current implementation of the hybrid approach to integrating distributed databases is to aggregate selected data from various remote sources as materialized views and cache them at the federation server to improve the performance of complex BI query workloads. In this paper, we propose an improvement that recommends Materialized Query Tables (MQTs) for backend servers for the benefits of load distribution and easy maintenance of aggregated data in conjunction with the current hybrid approach of data placement. Our approach considers the correlation between backend servers and recommends MQTs that are well coordinated among the backend servers and optimized for the workload. We also exploit the parallelism property among the backend servers to make our approach run almost linearly (in contrast to exponentially) with respect to the number of backend servers, without sacrificing its recommendation quality. Experimental evaluations validate the effectiveness and efficiency of our approach.