Bayesian Uncertainty Estimation Under Complex Sampling
Multistage sampling designs utilized by federal statistical agencies are typically constructed to maximize the efficiency of the target domain level estimator (e.g., indexed by geographic area) within cost constraints to administer survey instruments. Sampling designs are usually constructed to be informative, whereby inclusion probabilities are correlated with the response variable of interest to minimize the variance of the resulting estimator. Multistage sampling designs may induce dependence between the sampled units; for example, employment of a sampling step that selects geographically-indexed clusters of units in order to efficiently manage the cost of collection. A data analyst may use a sampling-weighted pseudo-posterior distribution to estimate the population model on the observed sample. The dependence induced between co-clustered units inflates the scale of the resulting pseudo-posterior covariance matrix that has been shown to induce under coverage of the credibility sets. While the pseudo-posterior distribution contracts on the true population model parameters, we demonstrate that the scale and shape of the asymptotic distributions are different between each of the MLE, the pseudo-posterior and the MLE under simple random sampling. Motivated by the different forms of the asymptotic covariance matrices and the within cluster dependence, we devise a correction applied as a simple and fast post-processing step to our MCMC draws from the pseudo-posterior distribution. Our updating step projects the pseudo-posterior covariance matrix such that the nominal coverage is approximately achieved with credibility sets that account for both the distributions for population generation, P_θ_0, and the multistage, informative sampling, P_ν. We demonstrate the efficacy of our procedure on synthetic data and make an application to the National Survey on Drug Use and Health.
READ FULL TEXT