Two-Phase Data Synthesis for Income: An Application to the NHIS
We propose a two-phase synthesis process for synthesizing income, a sensitive variable which is usually highly-skewed and has a number of reported zeros. We consider two forms of a continuous income variable: a binary form, which is modeled and synthesized in phase 1; and a non-negative continuous form, which is modeled and synthesized in phase 2. Bayesian synthesis models are proposed for the two-phase synthesis process, and other synthesis models such as classification and regression trees (CART) are readily implementable. We demonstrate our methods as applications to a sample from the National Health Interview Survey (NHIS). Utility and risk profiles of generated synthetic datasets are evaluated and compared to results from a single-phase synthesis process.
READ FULL TEXT