acm-header
Sign In

Communications of the ACM

ACM Careers

Hackathon Optimizes Code for Many-Core Processors


View as: Print Mobile App Share:
Xeon Phi hackathon participants

Xeon Phi hackathon participants at Brookhaven Lab included mentor Bei Wang (left) of Princeton University, mentor Hideki Saito (center) of Intel, and participant Han Aung, a graduate student at Yale University.

Credit: Brookhaven National Laboratory

Supercomputers are enabling scientists to study problems they could not otherwise tackle—from understanding what happens when two black holes collide and figuring out how to make tiny carbon nanotubes that clean up oil spills to determining the binding sites of proteins associated with cancer. Such problems involve datasets that are too large or complex for human analysis.

In 2016, Intel released the second generation of its many-integrated-core architecture targeting high-performance-computing: the Intel Xeon Phi processor (formerly code-named "Knights Landing"). With up to 72 processing units, or cores, per chip, Xeon Phi is designed to carry out multiple calculations at the same time (in parallel). This architecture is ideal for handling the large, complex computations that are characteristic of scientific applications.

Other features that make Xeon Phi appealing for such applications include its fast memory access; its ability to simultaneously execute multiple processes, or threads, that follow the same instructions while sharing some computing resources (multithreading); and its support of efficient vectorization, a form of parallel programming in which the processor performs the same operation on multiple elements (vectors) of independent data in a single processing cycle. All of these features can greatly enhance performance, enabling scientists to solve problems more quickly and with greater efficiency than ever before.

Xeon Phi Foe Fum

Currently, several supercomputers in the United States are based on Intel's Xeon Phi processors, including Cori at the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility at Lawrence Berkeley National Laboratory; Theta at Argonne Leadership Computing Facility, another DOE Office of Science User Facility; and Stampede2 at the University of Texas at Austin's Texas Advanced Computing Center. Smaller-scale systems, such as the computing cluster at DOE's Brookhaven National Laboratory, also rely on this architecture. But in order to take full advantage of its capabilities, users need to adapt and optimize their applications accordingly.

To facilitate that process, Brookhaven Lab's Computational Science Initiative (CSI) hosted a five-day coding marathon, or hackathon, in partnership with the High-Energy Physics Center for Computational Excellence—which Brookhaven joined last July—and collaborators from the SOLLVE software development project funded by DOE's Exascale Computing Project.

"The goal of this hands-on workshop was to help participants optimize their application codes to exploit the different levels of parallelism and memory hierarchies in the Xeon Phi architecture," says CSI computational scientist Meifeng Lin, who co-organized the hackathon with CSI Director Kerstin Kleese van Dam, CSI Computer Science and Mathematics Department Head Barbara Chapman, and CSI computational scientist Martin Kong. "By the end of the hackathon, the participants had not only made their codes run more efficiently on Xeon Phi-based systems, but also learned about strategies that could be applied to other CPU [central processing unit]-based systems to improve code performance."

Last year, Lin was part of the committee that organized Brookhaven's first hackathon, at which teams learned how to program their scientific applications on computing devices called graphics processing units. As was the case for that hackathon, this one was open to any current or potential user of the hardware. In the end, five teams of three to four members each—representing Brookhaven Lab, the Institute for Mathematical Sciences in India, McGill University, Stony Brook University, University of Miami, University of Washington, and Yale University—were accepted to participate in the Intel Xeon Phi hackathon.

Multi-Processors and Mentors

From February 26 through March 2, nearly 20 users of Xeon Phi-based supercomputers came together at Brookhaven Lab to be mentored by computing experts from Brookhaven and Lawrence Berkeley national labs, Indiana University, Princeton University, University of Bielefeld in Germany, and University of California-Berkeley. The hackathon organizing committee selected the mentors based on their experience in Xeon Phi optimization and shared-memory parallel programming with the OpenMP (for Multi-Processing) industry standard.

Participants did not need to have prior Xeon Phi experience to attend. Several weeks prior to the hackathon, the teams were assigned to mentors with scientific backgrounds relevant to the respective application codes. The mentors and teams then held a series of meetings to discuss the limitations of their existing codes and goals at the hackathon. In addition to their specific mentors, the teams had access to four Intel technical experts with backgrounds in programming and scientific domains. These Intel experts served as floating mentors during the event to provide expertise in hardware architecture and performance optimization. 

"The hackathon provided an excellent opportunity for application developers to talk and work with Intel experts directly," says mentor Bei Wang, an HPC software engineer at Princeton University. "The result was a significant speed up in the time it takes to optimize code, thus helping application teams achieve their science goals at a faster pace. Events like this hackathon are of great value to both scientists and vendors."

The five codes that were optimized cover a wide variety of applications:

  • A code for tracking particle-device and particle-particle interactions that has the potential to be used as the design platform for future particle accelerators.
  • A code for simulating the evolution of the quark-gluon plasma (a hot, dense state of matter thought to have been present for a few millionths of a second after the Big Bang) produced through high-energy collisions at Brookhaven's Relativistic Heavy Ion Collider, a DOE Office of Science User Facility.
  • An algorithm for sorting records from databases, such as DNA sequences to identify inherited genetic variations and disorders.
  • A code for simulating the formation of structures in the universe, particularly galaxy clusters.
  • A code for simulating the interactions between quarks and gluons in real time.

"Large-scale numerical simulations are required to describe the matter created at the earliest times after the collision of two heavy ions," says team member Mark Mace, a Ph.D. candidate in the Nuclear Theory Group in the Physics and Astronomy Department at Stony Brook University and the Nuclear Theory Group in the Physics Department at Brookhaven Lab. "My team had a really successful week—we were able to make our code run much faster (20x), and this improvement is a game changer as far as the physics we can study with the resources we have. We will now be able to more accurately describe the matter created after heavy-ion collisions, study a larger array of macroscopic phenomena observed in such collisions, and make quantitative predictions for experiments at RHIC and the Large Hadron Collider in Europe."

"With the new memory subsystem recently released by Intel, we can order a huge number of elements faster than with conventional memory because more data can be transferred at a time," says team member Sergey Madaminov, who is pursuing his Ph.D. in computer science in the Computer Architecture at Stony Brook Lab at Stony Brook University. "However, this high-bandwidth memory is physically located close to the processor, limiting its capacity. To mitigate this limitation, we apply smart algorithms that split data into smaller chunks that can then fit into high-bandwidth memory and be sorted inside it. At the hackathon, our goal was to demonstrate our theoretical results—our algorithms speed up sorting—in practice. We ended up finding many weak places in our code and were able to fix them with the help of our mentor and experts from Intel, improving our initial code more than 40x. With this improvement, we expect to sort much larger datasets faster."

According to Lin, the hackathon was highly successful—all five teams improved the performance of their codes, achieving from 2x to 40x speedups.

"It is expected that Intel Xeon Phi-based computing resources will continue operating until the next-generation exascale computers come online," says Lin. "It is important that users can make these systems work to their full potential for their specific applications."


 

No entries found

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account