On synthetic data with predetermined subject partitioning and cluster profiling, and pre-specified categorical variable marginal dependence structure

09/04/2017
by   Michail Papathomas, et al.
0

A standard approach for assessing the performance of partition or mixture models is to create synthetic data sets with a pre-specified clustering structure, and assess how well the model reveals this structure. A common format is that subjects are assigned to different clusters, with variable observations simulated so that subjects within the same cluster have similar profiles, allowing for some variability. In this manuscript, we consider observations from nominal, ordinal and interval categorical variables. Theoretical and empirical results are utilized to explore the dependence structure between the variables, in relation to the clustering structure for the subjects. A novel approach is proposed that allows to control the marginal association or correlation structure of the variables, and to specify exact correlation values. Practical examples are shown and additional theoretical results are derived for interval data, commonly observed in cohort studies, including observations that emulate Single Nucleotide Polymorphisms. We compare a synthetic dataset to a real one, to demonstrate similarities and differences.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset