Modeling Data Analytic Iteration With Probabilistic Outcome Sets
In 1977 John Tukey described how in exploratory data analysis, data analysts use tools, such as data visualizations, to separate their expectations from what they observe. In contrast to statistical theory, an underappreciated aspect of data analysis is that a data analyst must make decisions by comparing the observed data or output from a statistical tool to what the analyst previously expected from the data. However, there is little formal guidance for how to make these data analytic decisions as statistical theory generally omits a discussion of who is using these statistical methods. Here, we extend the basic idea of comparing an analyst's expectations to what is observed in a data visualization to more general analytic situations. In this paper, we propose a model for the iterative process of data analysis based on the analyst's expectations, using what we refer to as expected and anomaly probabilistic outcome sets, and the concept of statistical information gain. Our model posits that the analyst's goal is to increase the amount of information the analyst has relative to what the analyst already knows, through successive analytic iterations. We introduce two criteria–expected information gain and anomaly information gain–to provide guidance about analytic decision-making and ultimately to improve the practice of data analysis. Finally, we show how our framework can be used to characterize common situations in practical data analysis.
READ FULL TEXT