Linear regression and its inference on noisy network-linked data
Linear regression on a set of observations linked by a network has been an essential tool in modeling the relationship between response and covariates with additional network data. Despite its wide range of applications in many areas, such as social sciences and health-related research, the problem has not been well-studied in statistics so far. Previous methods either lack inference tools or rely on restrictive assumptions on social effects, and usually assume that networks are observed without errors, which is too good to be true in many problems. In this paper, we propose a linear regression model with nonparametric network effects. Our model does not assume that the relational data or network structure is exactly observed; thus, the method can be provably robust to a certain level of perturbation of the network structure. We establish a set of asymptotic inference results under a general requirement of the network perturbation and then study the robustness of our method in the specific setting when the perturbation comes from random network models. We discover a phase-transition phenomenon of inference validity concerning the network density when no prior knowledge about the network model is available, while also show the significant improvement achieved by knowing the network model. A by-product of our analysis is a rate-optimal concentration bound about subspace projection that may be of independent interest. We conduct extensive simulation studies to verify our theoretical observations, and demonstrate the advantage of our method over a few benchmarks in terms of accuracy and computational efficiency under different data-generating models. The method is then applied to adolescent network data to study gender and racial difference in social activities.
READ FULL TEXT