Skyblocking for Entity Resolution
In this paper, for the first time, we introduce the concept of skyblocking, which aims to efficiently identify the "most preferred" blocking scheme in terms of a given set of selection criteria for entity resolution blocking. To capture all possible preferred blocking schemes, scheme skyline (i.e. blocking schemes on the skyline) has been studied in a multi-dimensional scheme space with dimensions corresponding to selection criteria for blocking (e.g. PC and PQ). However, applying traditional skyline techniques to learn scheme skylines is a non-trivial task. Due to the unique characteristics of blocking schemes, we face several challenges, such as: how to find a balanced number of match and non-match labels to effectively approximate a block scheme in a scheme space, and how to design efficient skyline algorithms to explore a scheme space for finding scheme skylines. To overcome these challenges, we propose a scheme skyline learning approach, which incorporates skyline techniques into an active learning process of scheme skylines. We have conducted experiments over four real-world datasets. The experimental results show that our approach is able to efficiently identify scheme skylines in a large scheme space only using a limited number of labels. Our approach also outperforms the state-of-the-art approaches for learning blocking schemes in several aspects, including: label efficiency, blocking quality and learning efficiency.
READ FULL TEXT