Keywords: Data-driven decision making, Machine learning, Performance Analysis, High Performance Computing (HPC), Distributed Systems, Cycle Sharing Systems

Others: Scientific workflow applications, Fault-tolerance, Power-Aware Resilience.

Given that most applications pay for allocations on the HPC systems (aka supercomputers), their efficient utilization of the underlying systems is essential. Also, the difference in performance between a slow and a fast variant of an application could mean a more fine-grained scientific discovery attained faster. However, the complexity of many- and multi-core machines with heterogeneous architectures (e.g., dynamic branch prediction, prefetching, out-of-order scheduling, CPU and GPU units on the same node) makes it challenging for applications to scale up and for developers to pinpoint the cause of the performance bottlenecks. 


Keeping that in mind, the goal of my research group--"Scalability"--is to help scientists efficiently utilize the power of HPC. My research group develops performance measurement tools, analysis methodologies, and novel visualizations for users (both application scientists and HPC management) to quickly identify the causes of the performance and scalability issues of applications running on these systems. We develop new machine learning techniques to understand how factors such as network topology, source code structure, and machine models impact the performance of applications as well as the utilization of the HPC systems under various power, performance, and resilience constraints.



Chase Phelps (Graduate)

Tarek Ramadan (Graduate)

Elvis Fefey (Graduate)

Russell Hernandez Ruiz (Undergrad)

Ethan Greene (Undergrad)

Collaborators (partial list):
Aniruddha Marathe (LLNL)
Vivek Kale (BNL)
Evgeny Shcherbakov (AMD)
Jayaraman J. Thiagarajan (LLNL)
Khaled Ibrahim (LBL)

Barry Rountree (LLNL)

Markus Schordan (LLNL)

Todd Gamblin (LLNL)

Abhinav Bhatele (LLNL)

Kathryn Mohror (LLNL)

Tapasya Patki (LLNL)

Filip Jagodzinski (WWU)

Moushumi Sharmin (WWU)

Shameem Ahmed (WWU)

Brian Hutchinson (WWU)

Past Members:

Gian-Carlo DeFazio (grad, currently at LLNL)

Chloe Dawson (undergrad, WWU)

Alexis Ayala (undergrad, graduate student at WWU)

Quentin Jensen (undergrad, graduate student at WWU )

Nathan Pinnow (grad, Lawrence Livermore National Laboratory)

Nicholas Majeske (grad, Ph.D. student at Indiana University)

Research Projects

Data-driven decision making

This new direction of my research investigates the viability of leveraging automated data-driven analysis in decision making for various domains including job scheduling, resource management, health care, transportation planning and more. There are several collaborative research opportunities currently pursue in my lab and I am looking for students interests in optimizations to join my team.

Proxy Application Development and Validation

Proxy applications are written to represent subsets of performance behaviors of larger, and more complex applications that often have distribution restrictions. In this research, we developed a systematic methodology for quantitatively compare how well proxies match with their parents.

Scalable Checkpoint/Restart

My Ph.D. thesis built scalable checkpoint/restart systems for both high-throughput (Grid using Condor) and high-performance computing environments. A significant part of my thesis contributed to the Scalable Checkpoint Restart framework that won the R&D 100 award in 2019.

Comparative Performance analysis

My research developed principled approach for comparing performance between applications. The application of this methodology was proxy application validation, which is important for DOE's co-design centers. This project resulted in many publications, and open-source software tools

Power-Aware Resilience

This project developed several algorithms for shifting power in an I/O-aware manner to other applications in an HPC system to improve performance. Since I/O phases of applications use less power, moving it to applications or processes crunching numbers accelerates their computation.

Cache Bottleneck Characterization

In this project, I built a robust low-level library for Intel's PEBS counter collection in user space, and developed an analysis approach for correlating PEBS data with application-specific information such as line number to characterize cache-access bottlenecks of applications.

Performance-Aware Application Development

My group is developing machine learning models to predict the impact of a code change on application performance. This project aims to help application developers assess how their proposed code changes will impact that application's performance before the code is executed. We envision that the performance information collected over time from nightly tests provided as feedback to the model will significantly improve its accuracy.


In the past I have contributed to designing fault-tolerant interface for MPI. I developed performance measurement tools for assessing the performance of MPI_T interface.

Reliable and Efficient Checkpointing for Cycle-sharing Systems

This research developed failure models for high-throughput systems, e.g. Grids, and developed a scalable data transfer and checkpoint storing mechanism to provide high fault-tolerance in a volatile environment.