Modifying the Asynchronous Jacobi Method for Data Corruption Resilience
Moving scientific computation from high-performance computing (HPC) and cloud computing (CC) environments to devices on the edge, where data can be collected by streamlined computing devices that are physically near instruments of interest, has garnered tremendous interest in recent years. Such edge computing environments can operate on data in-situ instead of requiring the collection of data in HPC and/or CC facilities, offering enticing benefits that include avoiding costs of transmission over potentially unreliable or slow networks, increased data privacy, and real-time data analysis. Before such benefits can be realized at scale, new fault tolerances approaches must be developed to address the inherent unreliability of edge computing environments, because the traditional approaches used by HPC and CC are not generally applicable to edge computing. Those traditional approaches commonly utilize checkpoint-and-restart and/or redundant-computation strategies that are not feasible for edge computing environments where data storage is limited and synchronization is costly. Motivated by prior algorithm-based fault tolerance approaches, an asynchronous Jacobi (ASJ) variant is developed herein with resilience to data corruption by leveraging existing convergence theory. The ASJ variant rejects solution approximations from neighbor devices if the distance between two successive approximations violates an analytic bound. Numerical results show the ASJ variant restores convergence in the presence of certain types of natural and malicious data corruption.
READ FULL TEXT