Optimizing Organizations for Navigating Data Lakes

12/17/2018
by   Fatemeh Nargesian, et al.
0

Navigation is known to be an effective complement to search. In addition to data discovery, navigation can help users develop a conceptual model of what types of data are available. In data lakes, there has been considerable research on dataset or table discovery using search. We consider the complementary problem of creating an effective navigation structure over a data lake. We define an organization as a navigation structure (graph) containing nodes representing sets of attributes (from tables or from semi-structured documents) within a data lake. An edge represents a subset relationship. We propose a novel problem, the data lake organization problem where the goal is to find an organization that allows a user to most efficiently find attributes or tables. We present a new probabilistic model of how users interact with an organization and define the likelihood of a user finding an attribute or a table using the organization. Our approach uses the attribute values and metadata (when available). For data lakes with little or no metadata, we propose a way of creating metadata using metadata available in other lakes. We propose an approximate algorithm for the organization problem and show its effectiveness on a synthetic benchmark. Finally, we construct an organization on tables of a real data lake containing data from federal Open Data portals and show that the organization dramatically improves the expected probability of discovering tables over a baseline. Using a second real data lake with no metadata, we show how metadata can be inferred that is effective in enabling organization creation.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset