Online Fault Classification in HPC Systems through Machine Learning

10/26/2018
by   Alessio Netti, et al.
0

As High-Performance Computing (HPC) systems strive towards exascale goals, studies suggest that they will experience excessive failure rates, mainly due to the massive parallelism that they require. Long-running exascale computations would be severely affected by a variety of failures, which could occur as often as every few minutes. Therefore, detecting and classifying faults in HPC systems as they occur and initiating corrective actions through appropriate resiliency techniques before they can transform into failures will be essential for operating them. In this paper, we propose a fault classification method for HPC systems based on machine learning and designed for live streamed data. Our solution is cast within realistic operating constraints, especially those deriving from the desire to operate the classifier in an online manner. Our results show that almost perfect classification accuracy can be reached for different fault types with low computational overhead and minimal delay. Our study is based on a dataset, now publicly available, that was acquired by injecting faults to an in-house experimental HPC system.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset