Strategies to facilitate access to detailed geocoding information using synthetic data
In this paper we investigate if generating synthetic data can be a viable strategy for providing access to detailed geocoding information for external researchers without compromising the confidentiality of the units included in the database. This research was motivated by a recent project at the Institute for Employment Research (IAB) in Germany that linked exact geocodes to the Integrated Employment Biographies, a large administrative database containing several million records. Based on these data we evaluate the performance of several synthesizers in terms of addressing the trade-off between preserving analytical validity and limiting the risk of disclosure. We propose strategies for making the synthesizers scalable for such large files, present analytical validity measures for the generated data and provide general recommendations for statistical agencies considering the synthetic data approach for disseminating detailed geographical information.We also illustrate that the commonly used disclosure avoidance strategy of providing geographical information only on an aggregated level will not offer substantial improvements in disclosure protection if coupled with synthesis. As we show in the online supplement accompanying this manuscript that synthesizing additional variables should be preferred if the level of protection from synthesizing only the geocodes is not considered sufficient.
READ FULL TEXT