ACM

Communications of the ACM

Home/News/Diagnosing Performance Problems in Supercomputers/Full Text

ACM TechNews

Diagnosing Performance Problems in Supercomputers

By Government Computer News
December 13, 2017
Comments

View as: Print Mobile App Share:

Overseeing maintenance on a supercomputer. — Researchers at Sandia National Laboratories and Boston University spent more than a year developing a framework to automatically monitor and diagnose performance issues in supercomputers.

Credit: Sandia National Laboratories

Researchers at Sandia National Laboratories and Boston University (BU) have spent more than a year developing the Lightweight Distributed Metric Service (LDMS), a framework to automatically monitor and diagnose performance issues in supercomputers.

Using LDMS to diagnose supercomputer problems should help systems administrators allocate resources and schedule jobs to maximize performance.

The team says they used supervised machine learning, writing programs to reproduce known anomalies that would likely affect a Cray XC30m supercomputer at Sandia and BU's Mass Open Cloud system. With LDMS, the supercomputer compiled more than 700 metrics each second for each computer, and the cloud collected about 50 metrics at two- or three-second granularity.

Sandia's Vitus Leung notes the difference stems from the "noisiness of the data on the BU cloud, because it's not nearly as dedicated."

The researchers collated statistical characteristics of the data, filtering it to about 10% of the raw data, which was fed to machine-learning algorithms.

From Government Computer News
View Full Article

No entries found