Parallel meta-blocking for scaling entity resolution over big heterogeneous data
作者:
Highlights:
• We adapt Meta-blocking to the MapReduce paradigm through 3 alternative parallelization strategies: an edge-based strategy that explicitly builds the blocking graph, a comparison-based strategy that uses the blocking graph implicitly, as a conceptual model, and an entity-based strategy that is independent of the blocking graph. We also provide concrete implementations for all weighting schemes that are used in Meta-blocking.
• We present a load balancing technique that deals with skewness in the input block collection, splitting it into partitions of the same computational cost.
• We verify the scalability of our techniques through a thorough experimental evaluation over the four largest, real datasets that have been applied to Meta-blocking. The data and the implementation of our techniques are publicly available.
摘要
Highlights•We adapt Meta-blocking to the MapReduce paradigm through 3 alternative parallelization strategies: an edge-based strategy that explicitly builds the blocking graph, a comparison-based strategy that uses the blocking graph implicitly, as a conceptual model, and an entity-based strategy that is independent of the blocking graph. We also provide concrete implementations for all weighting schemes that are used in Meta-blocking.•We present a load balancing technique that deals with skewness in the input block collection, splitting it into partitions of the same computational cost.•We verify the scalability of our techniques through a thorough experimental evaluation over the four largest, real datasets that have been applied to Meta-blocking. The data and the implementation of our techniques are publicly available.
论文关键词:Meta-blocking,Map/Reduce model,Parallelization
论文评审过程:Received 8 December 2015, Revised 19 November 2016, Accepted 1 December 2016, Available online 9 December 2016, Version of Record 4 January 2017.
论文官网地址:https://doi.org/10.1016/j.is.2016.12.001