# A Mathematical and Data-Driven Approach to Intrusion Detection for High-Performance Computing

In this project, CRD researchers developed mathematical and statistical techniques to analyze the secure access and use of high-performance computer systems. The project was funded by the U.S. Department of Energy's Applied Mathematics Section.

The overall goals of the project were to develop mathematical and statistical methods to detect intrusions of high-performance computing systems. Our mathematical analysis is predicated on the fact that large HPC systems represent unique environments, quite unlike unspecialized systems or general Internet traffic. User behavior on HPC systems tends to be much more constrained (often driven by research deadlines and limited computational resources), and is generally limited to certain paradigms of computation (the set of codes performing the bulk of execution provide a rich source of information). In addition, the collaboration networks of users on an HPC system exhibits special characteristics than can be exploited to detect misuse or fraud. In this research work, we employed real system data, which we have obtained in collaboration with staff in the NERSC Division of LBNL.

LBNL was this lead institution for this activity, which also involved the University of California at Davis (UC Davis) and the International Computer Science Institute (ICSI) at UC Berkeley through LBNL subcontracts. Over the course of the project, LBNL senior investigators included Deb Agarwal, David H. Bailey (PI), Scott Campbell, Juan Meza, Sean Peisert, Taghrid Samak, and Alexander Slepoy. The UC Davis senior investigators were Sean Peisert (joint appointment) and Matt Bishop. ICSI senior investigators included Vern Paxson and Robin Sommer. The research team was supported in their software development and data collection techniques by the staff at the National Energy Research Scientific Computing Center (NERSC). A variety of students and postdocs at the Berkeley Lab, UC Berkeley, and UC Davis, were also involved, as were external collaborators (not funded by this grant), at Mt. Sinai School of Medicine and the University of San Francisco.

Recent work included:

- The adaptation and development of a "rule-ensemble" technique from the field of mathematical statistics to accurately and economically finds class labels, and to determine which parameter constraints are most useful for predicting these labels.
- The development and application of a technique of fingerprinting computation on HPC machines based on network theory and machine learning.
- An examination of Domain Name Server (DNS) traffic using entropy analysis.
- Development of data sanitization techniques, so that real cybersecurity data can be shared with a wider community of researchers without compromising user privacy.
- Development of a script to generate a set of synthetic jobs, which then can be used to test job fingerprint algorithms.