Parallel meta-blocking for scaling entity resolution over big heterogeneous data

作者:

Highlights:

• We adapt Meta-blocking to the MapReduce paradigm through 3 alternative parallelization strategies: an edge-based strategy that explicitly builds the blocking graph, a comparison-based strategy that uses the blocking graph implicitly, as a conceptual model, and an entity-based strategy that is independent of the blocking graph. We also provide concrete implementations for all weighting schemes that are used in Meta-blocking.

• We present a load balancing technique that deals with skewness in the input block collection, splitting it into partitions of the same computational cost.

• We verify the scalability of our techniques through a thorough experimental evaluation over the four largest, real datasets that have been applied to Meta-blocking. The data and the implementation of our techniques are publicly available.

摘要

Highlights•We adapt Meta-blocking to the MapReduce paradigm through 3 alternative parallelization strategies: an edge-based strategy that explicitly builds the blocking graph, a comparison-based strategy that uses the blocking graph implicitly, as a conceptual model, and an entity-based strategy that is independent of the blocking graph. We also provide concrete implementations for all weighting schemes that are used in Meta-blocking.•We present a load balancing technique that deals with skewness in the input block collection, splitting it into partitions of the same computational cost.•We verify the scalability of our techniques through a thorough experimental evaluation over the four largest, real datasets that have been applied to Meta-blocking. The data and the implementation of our techniques are publicly available.

论文关键词:Meta-blocking,Map/Reduce model,Parallelization

论文评审过程:Received 8 December 2015, Revised 19 November 2016, Accepted 1 December 2016, Available online 9 December 2016, Version of Record 4 January 2017.

论文官网地址:https://doi.org/10.1016/j.is.2016.12.001