Efficient Classification of Supercomputer Failures Using Neuromorphic Computing

Date, Prasanna
Carothers, Christopher D.
Hendler, James A.
Magdon-Ismail, Malik
No Thumbnail Available
Other Contributors
Issue Date
Terms of Use
Attribution-NonCommercial-NoDerivs 3.0 United States
Full Citation
P. Date, C. D. Carothers, J. A. Hendler and M. Magdon-Ismail, "Efficient Classification of Supercomputer Failures Using Neuromorphic Computing," 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India, 2018, pp. 242-249, doi: 10.1109/SSCI.2018.8628946.
Today's petascale supercomputers are comprised of ten's of thousands of compute nodes. Failures on these massive machines are a growing problem as the time for a single compute node to fail is shrinking. Ideally, the job scheduler would like the capability to predict node failures ahead of time in order to minimize the impact of node failures on overall job throughput. However, due to the tight power constraints of future systems, the online modeling of real-time error data must be accomplished using as little power as possible. To this end, the IBM TrueNorth Neurosynaptic System is used to create a Spiking Neural Network (SNN) model of supercomputer failure data and the classification accuracy of this model is compared to other Machine Learning (ML) and Deep Learning (DL) techniques. It is observed that the TrueNorth failure classification model yields a training accuracy of 99.41%, validation accuracy of 98.12% and testing accuracy of 99.80% and outperforms other machine learning and deep learning approaches. Moreover, the TrueNorth SNN consumes five orders of magnitude less power than the other ML/DL approaches during the testing phase. Additionally, it is observed that all ML/DL approaches investigated as part of this study are able to produce accurate models of the supercomputer system failure data.