Keywords: Data-driven decision making, Machine learning, Performance Analysis, High Performance Computing (HPC), Distributed Systems, Cycle Sharing Systems
Others: Scientific workflow applications, Fault-tolerance, Power-Aware Resilience.
Given that most applications pay for allocations on the HPC systems (aka supercomputers), their efficient utilization of the underlying systems is essential. Also, the difference in performance between a slow and a fast variant of an application could mean a more fine-grained scientific discovery attained faster. However, the complexity of many- and multi-core machines with heterogeneous architectures (e.g., dynamic branch prediction, prefetching, out-of-order scheduling, CPU and GPU units on the same node) makes it challenging for applications to scale up and for developers to pinpoint the cause of the performance bottlenecks.
Keeping that in mind, the goal of my research group--"Scalability"--is to help scientists efficiently utilize the power of HPC. My research group develops performance measurement tools, analysis methodologies, and novel visualizations for users (both application scientists and HPC management) to quickly identify the causes of the performance and scalability issues of applications running on these systems. We develop new machine learning techniques to understand how factors such as network topology, source code structure, and machine models impact the performance of applications as well as the utilization of the HPC systems under various power, performance, and resilience constraints.
Chase Phelps (Graduate)
Tarek Ramadan (Graduate)
Elvis Fefey (Graduate)
Russell Hernandez Ruiz (Undergrad)
Ethan Greene (Undergrad)
Collaborators (partial list):
Aniruddha Marathe (LLNL)
Vivek Kale (BNL)
Evgeny Shcherbakov (AMD)
Jayaraman J. Thiagarajan (LLNL)
Khaled Ibrahim (LBL)
Barry Rountree (LLNL)
Markus Schordan (LLNL)
Todd Gamblin (LLNL)
Abhinav Bhatele (LLNL)
Kathryn Mohror (LLNL)
Tapasya Patki (LLNL)
Filip Jagodzinski (WWU)
Moushumi Sharmin (WWU)
Shameem Ahmed (WWU)
Brian Hutchinson (WWU)
Gian-Carlo DeFazio (grad, currently at LLNL)
Chloe Dawson (undergrad, WWU)
Alexis Ayala (undergrad, graduate student at WWU)
Quentin Jensen (undergrad, graduate student at WWU )
Nathan Pinnow (grad, Lawrence Livermore National Laboratory)
Nicholas Majeske (grad, Ph.D. student at Indiana University)
Data-driven decision making
This new direction of my research investigates the viability of leveraging automated data-driven analysis in decision making for various domains including job scheduling, resource management, health care, transportation planning and more. There are several collaborative research opportunities currently pursue in my lab and I am looking for students interests in optimizations to join my team.
Proxy Application Development and Validation
Proxy applications are written to represent subsets of performance behaviors of larger, and more complex applications that often have distribution restrictions. In this research, we developed a systematic methodology for quantitatively compare how well proxies match with their parents.
Comparative Performance analysis
My research developed principled approach for comparing performance between applications. The application of this methodology was proxy application validation, which is important for DOE's co-design centers. This project resulted in many publications, and open-source software tools
This project developed several algorithms for shifting power in an I/O-aware manner to other applications in an HPC system to improve performance. Since I/O phases of applications use less power, moving it to applications or processes crunching numbers accelerates their computation.
Cache Bottleneck Characterization
In this project, I built a robust low-level library for Intel's PEBS counter collection in user space, and developed an analysis approach for correlating PEBS data with application-specific information such as line number to characterize cache-access bottlenecks of applications.
Performance-Aware Application Development
My group is developing machine learning models to predict the impact of a code change on application performance. This project aims to help application developers assess how their proposed code changes will impact that application's performance before the code is executed. We envision that the performance information collected over time from nightly tests provided as feedback to the model will significantly improve its accuracy.