Surrogate-Assisted Federated Learning of high dimensional Electronic Health Record Data
Surrogate variables in electronic health records (EHR) play an important role in biomedical studies due to the scarcity or absence of chart-reviewed gold standard labels, under which supervised methods only using labeled data poorly perform poorly. Meanwhile, synthesizing multi-site EHR data is crucial for powerful and generalizable statistical learning but encounters the privacy constraint that individual-level data is not allowed to be transferred from the local sites, known as DataSHIELD. In this paper, we develop a novel approach named SASH for Surrogate-Assisted and data-Shielding High-dimensional integrative regression. SASH leverages sizable unlabeled data with EHR surrogates predictive of the response from multiple local sites to assist the training with labeled data and largely improve statistical efficiency. It first extracts a preliminary supervised estimator to realize convex training of a regularized single index model for the surrogate at each local site and then aggregates the fitted local models for accurate learning of the target outcome model. It protects individual-level information from the local sites through summary-statistics-based data aggregation. We show that under mild conditions, our method attains substantially lower estimation error rates than the supervised or local semi-supervised methods, as well as the asymptotic equivalence to the ideal individual patient data pooled estimator (IPD) only available in the absence of privacy constraints. Through simulation studies, we demonstrate that SASH outperforms all existing supervised or SS federated approaches and performs closely to IPD. Finally, we apply our method to develop a high dimensional genetic risk model for type II diabetes using large-scale biobank data sets from UK Biobank and Mass General Brigham, where only a small fraction of subjects from the latter has been labeled via chart reviewing.
READ FULL TEXT