Keywords: Data-driven decision making, Machine learning, Performance Analytics, High Performance Computing (HPC), Distributed Systems, Cycle Sharing Systems
Others: Scientific workflow applications, Fault-tolerance, Power-Aware Resilience.
Given that most applications pay for allocations on the HPC systems (aka supercomputers), their efficient utilization of the underlying systems is essential. Also, the difference in performance between a slow and a fast variant of an application could mean a more fine-grained scientific discovery attained faster. However, the complexity of many- and multi-core machines with heterogeneous architectures (e.g., dynamic branch prediction, prefetching, out-of-order scheduling, CPU and GPU units on the same node) makes it challenging for applications to scale up and for developers to pinpoint the cause of the performance bottlenecks.
Keeping that in mind, the goal of my research group--"Scalability"--is to help scientists efficiently utilize the power of HPC. My research group develops performance measurement tools, analysis methodologies, and novel visualizations for users (both application scientists and HPC management) to quickly identify the causes of the performance and scalability issues of applications running on these systems. We develop new machine learning techniques to understand how factors such as network topology, source code structure, and machine models impact the performance of applications as well as the utilization of the HPC systems under various power, performance, and resilience constraints.
Collaborators (partial list):
Khaled Ibrahim (LBNL)
Yang Liu (LBNL)
Xiaoye Sherry Li (LBNL)
Jae-Seung Yeom (LLNL)
Jayaraman J. Thiagarajan (LLNL)
Barry Rountree (LLNL)
Markus Schordan (LLNL)
Todd Gamblin (LLNL)
Aniruddha Marathe (LLNL)
Kathryn Mohror (LLNL)
Tapasya Patki (LLNL)
Kerstin Kleese Van Dam (BNL)
Line Pouchard (BNL)
Vivek Kale (BNL)
Bogdan Nicolae (ANL)
Rob Ross (ANL)
Filip Jagodzinski (WWU)
Moushumi Sharmin (WWU)
Shameem Ahmed (WWU)
Brian Hutchinson (WWU)
Tarek Ramadan (Graduate, Oracle)
Russell Hernandez Ruiz (undergrad, TxState)
Ethan Greene (undergrad, TxState)
Holland Schutte (undergrad, WWU)
Gian-Carlo DeFazio (grad, currently at LLNL)
Nathan Pinnow (grad, currently at LLNL)
Nicholas Majeske (grad, currently Ph.D. student at Indiana University)
Anna Zivkovik (undergrad, WWU)
Jack Stratton (undergrad, WWU)
David Smith (undergrad, WWU)
Tony Dinh (undergrad, WWU)
Trevor Marcus (undergrad, WWU)
Chloe Dawson (undergrad, WWU)
Alexis Ayala (undergrad, graduate student at WWU)
Quentin Jensen (undergrad, graduate student at WWU)
Philip Wu Liang (undergrad, WWU)
Forest Sweeney (undergrad, WWU)
Cody Pragner (undergrad, WWU)
Open-source Software Releases:
- libNVCD: An easy-to-use, performance measurement and analysis tool for NVIDIA-based GPUs. Latest public, open-source release of libNVCD was version 1.0 on September, 2022.
Dashing: I developed an interpretable machine learning toolkit for HPC Performance Analysis. Latest public, open-source release of Dashing was version 1.0 on Aug 4, 2020.
GPTune: GPTune is an online autotuning framework an autotuner for suggesting optimal execution parameters to users. We integrated Dashing's importance analysis and visualization capabilities to GPTune. https://gptune.lbl.gov/
SCR: Scalable Checkpoint/Restart for MPI. The project won R&D 100 award in 2019. Latest public release of SCR was version 2.0.0 on March 28, 2019.
Gyan: Performance Measurement Tool for MPI implementations. Latest public, open-source release of Gyan was version 1.0 on May 7, 2014.
Open-source Data Releases:
On-node scaling data on HPC systems. 2019. DOI: 10.5281/zenodo.4315003.
Performance characterization data for AMReX applications developed by the DOE Exascale Computing Project (ECP). 2020. doi: 10.5281/zenodo.3403037
Data-driven decision making
This new direction of my research investigates the viability of leveraging automated data-driven analysis in decision making for various domains including job scheduling, resource management, health care, transportation planning and more. There are several collaborative research opportunities currently pursue in my lab and I am looking for students interests in optimizations to join my team.
Proxy Application Development and Validation
Proxy applications are written to represent subsets of performance behaviors of larger, and more complex applications that often have distribution restrictions. In this research, we developed a systematic methodology for quantitatively compare how well proxies match with their parents.
My Ph.D. thesis built scalable checkpoint/restart systems for both high-throughput (Grid using Condor) and high-performance computing environments. A significant part of my thesis contributed to the Scalable Checkpoint Restart framework that won the R&D 100 award in 2019.
Comparative Performance analysis
My research developed principled approach for comparing performance between applications. The application of this methodology was proxy application validation, which is important for DOE's co-design centers. This project resulted in many publications, and open-source software tools
This project developed several algorithms for shifting power in an I/O-aware manner to other applications in an HPC system to improve performance. Since I/O phases of applications use less power, moving it to applications or processes crunching numbers accelerates their computation.
Cache Bottleneck Characterization
In this project, I built a robust low-level library for Intel's PEBS counter collection in user space, and developed an analysis approach for correlating PEBS data with application-specific information such as line number to characterize cache-access bottlenecks of applications.
Performance-Aware Application Development
My group is developing machine learning models to predict the impact of a code change on application performance. This project aims to help application developers assess how their proposed code changes will impact that application's performance before the code is executed. We envision that the performance information collected over time from nightly tests provided as feedback to the model will significantly improve its accuracy.
MPI and MPI_T
In the past I have contributed to designing fault-tolerant interface for MPI. I developed performance measurement tools for assessing the performance of MPI_T interface.
Reliable and Efficient Checkpointing for Cycle-sharing Systems
This research developed failure models for high-throughput systems, e.g. Grids, and developed a scalable data transfer and checkpoint storing mechanism to provide high fault-tolerance in a volatile environment.