Lawrence Livermore National Laboratory (LLNL) researchers have developed the Stack Trace Analysis Tool (STAT), a highly scalable, lightweight tool that has been used to debug a program running more than one million MPI processors on the IBM Blue Gene/Q-based Sequoia supercomputer. The debugging tool is part of a multi-year collaboration between LLNL, the University of Wisconsin, Madison, and the University of New Mexico.
The researchers say STAT has helped early access users and system integrators quickly isolate a wide range of errors, including complicated issues that only appeared at extremely large scales. "STAT has been indispensable in this capacity, helping the multi-disciplined integration team keep pace with the aggressive system scale-up schedule," says LLNL's Greg Lee.
During testing, STAT was able to identify one particular rank process that was consistently stuck in a system call out of more than one million MPI processes, according to LLNL's Dong Ahn.
"It is critical that our development teams have a comprehensive parallel debugging tool set as they iron out the inevitable issues that come up with running on a new system like Sequoia," says LLNL's Kim Cupps.
From Lawrence Livermore National Laboratory
View Full Article
Abstracts Copyright © 2012 Information Inc., Bethesda, Maryland, USA
No entries found