Skip to navigation Skip to content
Careers | Phone Book | A - Z Index

AI-based Approach Speeds Diagnosis of I/O Performance Bottlenecks in HPC

March 13, 2024

By Kathy Kincade

A high-level overview of the AIIO approach to applying AI and its interpretation technologies to diagnose I/O performance bottlenecks for a job. (Credit: Bin Jong)


Researchers from Lawrence Berkeley National Laboratory (Berkeley Lab) have developed a novel AI-based method for diagnosing input/output (I/O) performance bottlenecks in high performance computing (HPC) that automatically identifies these bottlenecks at the job level and offers potential solutions. In modeling experiments run at the National Energy Research Scientific Computing Center (NERSC), this approach – dubbed “AIIO” (Artificial Intelligence for I/O) – demonstrated that real applications and even unseen applications (those not used in the training model) can use these diagnostic results to improve I/O performance on HPC systems.

This work demonstrates how AI prediction-based performance functions, combined with new AI interpretation technologies, could be used to calculate the impact of various factors on I/O performance. It also lays the foundation for using AI to automatically identify and address I/O performance issues across multiple scientific applications and their communication, memory access, and computing capabilities.


Efficient I/O management is critical for minimizing data transfer times and optimizing overall system performance for large-scale scientific simulations and data-intensive applications. But the complex software and hardware parallel I/O stack of HPC platforms creates a challenge for end users to achieve optimal I/O performance and understand the root causes of I/O bottlenecks they encounter along the way. Thus, it is important for users to be able to quickly identify the causes of I/O performance bottlenecks in HPC applications because this information can significantly reduce I/O costs and shorten runtimes. 

Manually diagnosing I/O bottlenecks has long been the norm, but this approach is tedious and error-prone and requires domain scientists to have deep knowledge of complex HPC storage systems. While some automated diagnostic methods do exist, they too have limitations; in particular, the analysis is confined to the platform or group level rather than the job (application) level, so the diagnostic results cannot be applied to an individual job.

These challenges prompted data management researchers in Berkeley Lab's Scientific Data Division to spend the last decade-plus investigating a variety of approaches to better understand I/O performance bottlenecks and address these bottlenecks automatically. Initially, the team tried different methods – including classical statistical methods, analytical models, data mining approaches, and a relatively new visualization tool (Drishti) – to obtain multiple I/O performance logs and use them to identify the root causes of poor performance. But ultimately they realized that AI tools might help identify the parameters that most affect I/O performance, and that using these technologies would enable them to focus on analyzing a single application’s I/O logs rather than multiple application logs. This approach is at the core of AIIO.


The Berkeley Lab team is not the first to apply AI techniques to I/O performance analysis, but – to the best of their knowledge – AIIO is the first to use AI and its interpretation technologies to automatically diagnose I/O performance bottlenecks at the job level. 

Through their research, they identified key factors that affect performance and diagnostic issues in this process, leading to the incorporation of both an AI prediction-based performance function and an AI interpretation-based diagnosis function in the AIIO software:

  • To reduce the performance function for a single job, AIIO uses multiple AI models depending on the job and domain; these currently include MLP (a neural network), XGBoost (a gradient boosting method used to build machine learning models), LightGBM (a machine learning model that uses a gradient boosting decision tree), CatBoost (which also uses gradient boosting), and TabNet (a deep neural network). 
  • AIIO’s AI multiple interpretation-based diagnosis functions include SHapley Additive exPlanations (SHAP), a game-theory-based diagnostic tool that unifies other diagnostic methods such as LIME, PDP, and DeepLIFT.

The team evaluated AIIO using synthetic and real application workloads from diverse domains and 40 months of logs from the Darshan I/O log database on NERSC’s Cori system. They also tested it on six different I/O patterns on three currently used DOE applications: E2E, OpenPMD, and DASSA. Using AIIO, the I/O performance bottleneck diagnosis improvements on these applications ranged from 1.8x on E2E, 2.1x on OpenPMD, and 146x on DASSA.

The researchers are now investigating how AIIO could enable runtime systems – not just humans – to identify what is going on in an application’s I/O performance environment at any given time.

Research Lead

Bin Dong: Scientific Data Division, Berkeley Lab


Jean Luca Bez: Scientific Data Division, Berkeley Lab

Suren Byna: The Ohio State University; Scientific Data Division, Berkeley Lab


AIIO: Using Artificial Intelligence for Job-Level andAutomatic I/O Performance Bottleneck DiagnosisHPDC '23: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, August 2023, 155–167.


Exascale Computing Project (ExaIO sub-project), ASCR

User Facilities


About Berkeley Lab

Founded in 1931 on the belief that the biggest scientific challenges are best addressed by teams, Lawrence Berkeley National Laboratory and its scientists have been recognized with 16 Nobel Prizes. Today, Berkeley Lab researchers develop sustainable energy and environmental solutions, create useful new materials, advance the frontiers of computing, and probe the mysteries of life, matter, and the universe. Scientists from around the world rely on the Lab’s facilities for their own discovery science. Berkeley Lab is a multiprogram national laboratory, managed by the University of California for the U.S. Department of Energy’s Office of Science.

DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit